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Abstract 

Breakage-Fusion-Bridgc (BFB) is a mechanism of genomic instability characterized 
by the joining and subsequent tearing apart of sister chromatids. When this process 
is repeated during multiple rounds of cell division, it leads to patterns of copy number 
increases of chromosomal segments as well as fold-back inversions where duplicated 
segments are arranged head-to-head. These structural variations can then drive tu- 
morigenesis. 

BFB can be observed in progress using cytogenetic techniques, but generally BFB 
must be inferred from data like microarrays or sequencing collected after BFB has 
ceased. Making correct inferences from this data is not straightforward, particularly 
given the complexity of some cancer genomes and BFB's ability to generate a wide 
range of rearrangement patterns. 

Here we present algorithms to aid the interpretation of evidence for BFB. We first 
pose the BFB count vector problem: given a chromosome segmentation and segment 
copy numbers, decide whether BFB can yield a chromosome with the given segment 
counts. We present the first linear-time algorithm for the problem, improving a pre- 
vious exponential-time algorithm. We then combine this algorithm with fold-back 
inversions to develop tests for BFB. We show that, contingent on assumptions about 
cancer genome evolution, count vectors and fold-back inversions arc sufficient evidence 
for detecting BFB. We apply the presented techniques to paired-end sequencing data 
from pancreatic tumors and confirm a previous finding of BFB as well as identify a new 
chromosomal region likely rearranged by BFB cycles, demonstrating the practicality 
of our approach. 



1 Introduction 



Genomic instability allows cells to acquire the functional capabilities needed to become 
cancerous [1], so understanding the origin and operation of genomic instability is crucial 
to finding effective treatments for cancer. Numerous mechanisms of genomic instability 
have been proposed [2], including the faulty repair of double-stranded DNA breaks by re- 
combination or end-joining and polymerase hopping caused by replication fork collapse [3]. 
These mechanisms are generally not directly observable, so their elucidation requires the 
deciphering of often subtle clues after genomic instability has ceased. 

In contrast, the breakage-fusion-bridge (BFB) mechanism creates gross chromosomal 
abnormalities that can be seen in progress using methods that have been available for 
decades [4]. BFB begins when a chromosome loses a telomere (Figs, la, lb). Then during 
replication, the two sister chromatids of the telomere-lacking chromosome fuse together 
(Figs, lc, Id). During anaphase, as the centromeres of the chromosome migrate to opposite 
ends of the cell (Fig. le), the fused chromatids are torn apart (Fig. If). Each daughter cell 
receives a chromosome missing a telomere, and the cycle can begin again. As this process 
repeats, it can lead to the rapid accumulation of amplifications and rearrangements that 
facilitates the transition to malignancy [5] . 

This process produces several plainly identifiable cytogenetic signatures such as anaphase 
bridges and dicentric chromosomes. However, as cancer genomics has shifted to high- 
throughput techniques, the signatures of BFB have become less clear. Methods like mi- 
croarrays and sequencing do not allow for direct observation of BFB; instead BFB is now 
similar to other mechanisms of instability in that it must be inferred by finding its footprint 
in complex data. 

Multiple groups have begun to address the problem of finding evidence for BFB in 
high-throughput data. For example, Bignell et al. found a pattern of inversions and 
exponentially increasing copy numbers "[bearing] all the architectural hallmarks at the 
sequence level" of BFB [6]. Kitada and Yamasaki found a pattern of copy counts and 
segment organization consistent with a particular set of BFB cycles [7]. Hillmer et al. 
used paired-end sequencing to find patterns of inversions and amplification explainable by 
BFB [8]. 

The procedures of these investigators, among others [9, 10, 11], share an element in 
common: they determine whether a particular observation is consistent with or could be 
explained by BFB. While this is helpful, it does not on its own allow one to infer whether 
or not BFB occurred. Indeed, in a previous work [12] we examined short patterns of copy 
number increases consisting of five or six chromosome segments. We found that most such 
patterns, whether produced by BFB or not, were consistent or nearly consistent with BFB. 
Thus, finding that such a pattern was consistent with BFB would only be weak evidence 
that it had been produced by BFB. This finding highlights the need for a rigorous and 
systematic approach to the interpretation of modern data for BFB in order to avoid being 
misled by the complexity of cancer genomes and the BFB mechanism itself. 
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Here we present a framework for interpreting high-throughput data for signatures of 
BFB. We incorporate observations of breakpoints as well as copy numbers to create a 
scoring scheme for chromosomes. Through simulations, we find appropriate threshold 
scores for labeling a chromosome as having undergone BFB based on varying models of 
cancer genome evolution and tolerances for error. This framework complements the work 
of previous groups by not only finding breakpoint and copy number patterns consistent 
with BFB but also showing under what assumptions they are more likely to be observed 
if BFB occurred than if it did not. 

The key technical contribution that underlies our scoring scheme is a new, fast al- 
gorithm for determining if a given pattern of copy counts is consistent with BFB. This 
algorithm is related to a previously described algorithm [12] in that it takes advantage of 
a distinctive feature of BFB: when fused chromatids are torn apart, they may not tear at 
the site of fusion. This yields chromosomes with either a terminal deletion or a terminal 
inverted duplication. When a chromosome undergoes this process repeatedly, it results in 
particular patterns of copy number increases. The running time of the earlier algorithm 
grew exponentially with the amount of amplification and the number of segments in a copy 
number pattern. This greatly narrowed the scope of copy number patterns that could be 
investigated. This was particularly limiting because it appeared that copy number pat- 
terns with more segments would be more useful for identifying BFB, but these patterns 
could not be evaluated in a reasonable amount of time with the previous method. The new 
algorithm presented here is linear time and therefore allows complex copy number patterns 
to be checked in a trivial amount of time. 

We begin by describing the kinds of high-throughput data that can provide evidence for 
BFB. We then proceed to lay out some formalizations needed to precisely describe scoring 
methods of samples based on BFB evidence implied from such data. Next, we define related 
computational problems, followed by an outline of algorithms for these problems. In the 
results section, we detail the simulations we used to measure the performance of our scoring 
system for BFB. Based on simulation parameters, we find false and true positive rates for 
different BFB signatures. We apply our methods to two datasets. The first is copy number 
data from 746 cancer cell lines [13]. We find three chromosomes that have long copy 
number patterns consistent with BFB, but the false positive rates from our simulations 
suggest that these may be false discoveries. We also examine paired-end sequencing data 
from pancreatic cancers [14]. We find two chromosomes that likely have undergone BFB, 
one that was identified by the original publishers of the data and one novel finding. 

2 High-throughput evidence for BFB 

We consider two experimental sources for evidence for BFB: microarrays and sequencing. 
Microarrays allow for the estimation of the copy number of segments of a chromosome 
by measuring probe intensities [15]. Sequencing also yields copy number estimates by 
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measuring depth of sequence coverage [16]. In addition, if the sequencing uses paired-end 
reads and is performed on the whole genome rather than, say, the exome, it can reveal 
genomic breakpoints where different portions of the genome are unexpectedly adjacent. 
This is generally the extent of evidence available from either technique. Sequencing does 
not allow for a full reconstruction of a rearranged chromosome, as the repetitive nature of 
the genome leads to multiple alternative assemblies. Neither method can resolve segment 
copy numbers by orientation, so copy numbers from both forward and reversed chromosome 
segments are summed. Nevertheless, BFB should leave its signature in both breakpoints 
and copy counts, and we examine each in turn. 

2.1 Breakpoints 

During BFB, the telomere-lacking sister chromatids are fused together. This causes the 
ends of the sister chromatids to become adjacent but in opposite orientations (see Fig. Id). 
This adjacency is unlikely to be disrupted by subsequent BFB cycles and will remain in 
the final sequence as two duplicated segments arranged head-to-head. If the chromosome 
is paired-end sequenced, the rearrangement will appear as two ends that map very near 
each other but in opposite orientations. This type of rearrangement has been termed a 
"fold-back inversion" [14], and regions of a chromosome rearranged by BFB should have 
an enrichment of these fold-back inversions. Reliable indications for fold-back inversions 
may or may not be available, depending on the type of experiment and its intensity. 

2.2 Copy counts 

Each BFB cycle duplicates some telomeric portion of the chromosome undergoing BFB. 
These repeated duplications should lead to certain characteristic copy number patterns, 
which are the signature of BFB in copy number data. We would like to evaluate copy 
numbers observed from microarrays or sequencing and determine if the copy numbers con- 
tain the footprint of BFB. Previous groups have searched for such a footprint by manually 
inspecting copy number data and searching for a set of BFB cycles that could produce 
the observed copy numbers [6, 7]. This approach is challenging and labor intensive, but 
developing a more general approach turns out to be rather difficult. A key technical con- 
tribution of this paper is the development of efficient algorithms to evaluate copy counts 
for consistency with BFB. 

2.3 Formalizing BFB 

Creating an efficient method for evaluating copy numbers requires some formalization, so 
we begin with some definitions and basic results. 

We represent a chromosome as a string ABC. . . , where each letter corresponds to a 
contiguous segment of the chromosome. For example, the string ABCD would symbolize 
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a chromosome arm composed of four segments, where A is the segment nearest the cen- 
tromere. More generally, we use 07 for the Z-th segment in a chromosome. So, ABCD 
could be written <ti<T20"3CJ4. A bar notation, a, is used to signify that a segment is reversed. 
Greek letters a, /3, 7, p denote concatenations of chromosomal segments, and a bar will 
again mean that the concatenation is reversed. For example if a = oxo^o 2 j <5 = oio%o\. 
An empty string is denoted by e. 

Consider the following BFB cycle on a chromosome X • ABCD, where • represents the 
centromere, X is one chromosomal arm, and ABCD is the 4-segmented other chromosomal 
arm which has lost a telomere. The cycle starts with the duplication of the chromosome 
into two sister chromatids and their fusion at the ends of the 'D' segments, generating the 
dicentric chromosome X» ABCDDCbA»X. During anaphase, the two centromeres migrate 
to opposite poles of the cell and a breakage of the dicentric chromosome occurs between the 
centromeres, say between D and C, providing one daughter cell with a chromosome with 
an inverted suffix, X • ABCDD, and another daughter cell with the trimmed chromosome 
X»ABC (chromosomes CbA»X and X» ABC are equivalent). The now amplified segment 
D in the first daughter cell may confer some proliferative advantage, causing its descendants 
to increase in frequency. The daughter cells also lack a telomere on one chromosome arm 
and therefore may undergo additional BFB cycles. One possible subsequent cycle could, 
for example, cause an inverted duplication of the suffix CDD, yielding the chromosome 
X • ABCDDDDC. As these BFB cycles continue, the count of segments on the modified 
chromosome arm can increase significantly. 

The notation a ^» (3 will be used for indicating that the string f3 can be obtained by 
applying or more BFB cycles over the string a, as formally described in Definition 1. 

Definition 1 For two strings a, (3, say that a (3 if (3 = a, or a = p7 for some strings 
p, 7 such that 7 7^ e, and pj^f — > (3. 

We say that (3 is an I -BFB string if for some consecutive chromosomal region a = 
0"/ct;+i . . . starting at the Z-th segment 07, a — > f3. Say that f3 is a BFB string if it is an 
/-BFB string for some I. As examples, CDE = CT3CT40-5 is a 3-BFB string, and so are CDEE 
and CDEEEED. The empty string e is considered an /-BFB string for every integer / > 0. 

Denote by n(a) = [n\, ri2, ■ ■ ■ , n&] the count vector of a, where a represents a modified 
chromosomal arm G\o~2 . . . with k segments, and n/ is the count of occurrences (or copy 
number) of 07 and a\ in a. For example, for a = BCDDCC, n(a) = [0, 1,3,2]. Say that a 
vector n is a BFB count vector if there exists some 1-BFB string a such that n = n(a). 

2.4 Handling experimental imprecision 

Experimental methods do not provide the precise and accurate copy number of a given 
chromosome segment. Instead, some measurement error is expected. Moreover, in a cancer 
genome, it is plausible that a region undergoing BFB may also be rearranged by other 
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mechanisms. So when we evaluate a count vector for consistency with BFB, we must also 
consider whether the count vector is "nearly" consistent with BFB. 

For this, we define a distance measure 5 between count vectors, where 5 (re, re') reflects 
a penalty for assuming that the real copy counts are n' while the measured counts are 
re. We have implemented such a distance measure based on the Poisson likelihood of the 
observation, as follows: Let Pr(re|re') = — ^ - be the Poisson probability of measuring a 
copy number n, given that the segment's true copy number is re'. Assuming measurement 
errors are independent, the probability for measuring a count vector n = [rei,re2, . . . , re&], 
where the true counts are re' = [re' l5 re' 2 , . . . , n' k ] is given by Pr(re|re') = j"J Pr(refc|re' fe ). 

i<i<k 

Define the distance of re from re' by 

x ,- -/x n Pr(re|re') 
dire, re ) = 1— — — — - 
v ' 1 Pr(re'|re') 

For every pair of count vectors re and re' of the same length, < <5(n, re') < 1, being 
closer to the greater is the similarity between re and re'. 



2.5 The BFB Count Vector Problem 

With these definitions, we can now precisely pose a set of problems that need to be solved 
to evaluate copy number patterns for consistency with BFB: 

BFB count vector problem variants 
Input : a count vector n = [rei , n<i , . . . , n&] . 

• The decision variant: decide if n is a BFB count vector. 

• The search variant: if n is a BFB count vector, find a BFB string a such that 
n = n(a). 

• The distance variant: Identify a BFB count vector n' such that 5 (re, re') is mini- 
mized. Output 5. 



3 Outline of the BFB Count Vector Algorithms 

We defer the full details of the algorithms we have developed to the accompanying Support- 
ing Information (SI) document, presenting here only essential properties of BFB strings 
and some intuition of how to incorporate these properties in algorithms for BFB count 
vector problems. We focus on the search variant of the problem, where the goal of the 
algorithm is to output a BFB string a consistent with the input counts, if such a string 
exists. 
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3.1 Properties of BFB palindromes 

Call an /-BFB string (3 of the form f3 = aa an I -BFB palindrome 1 . For an /-BFB string 
a, the string (3 = aa is an /-BFB palindrome by definition (choosing p = e and 7 = a in 
Definition 1). In [12], it was shown that every prefix of a BFB string is itself a BFB string, 
thus, for an /-BFB palindrome f3 = aa, the prefix a of (3 is also an /-BFB string. Hence, 
it follows that a is an /-BFB string if and only if (3 = aa is an /-BFB palindrome. For a 
BFB string a with n(a) = [n\,n2, . . . ,nk] and a corresponding BFB palindrome (3 = aa, 
we have that n{(3) = 2n(a) = [2rii, 2ri2, ■ ■ ■ , 2n^]. Thus, a count vector n is a BFB count 
vector if and only if there is a 1-BFB palindrome (3 such that n{f3) = 2n. Considering BFB 
palindromes instead of BFB strings will facilitate the algorithm description. 

Define an l-block as a palindrome of the form (3 = ai(3'ai, where (3' is an (/ + 1)-BFB 
palindrome. For example, from the 4-BFB palindromes ft = DEEDDEED and f3' 2 = e, we 
can produce the 3-blocks ft = a 3 (3[a 3 = CDEEDDEEDC and ft = a 3 f3' 2 a 3 = CC. It may 
be asserted that an /-block is a special case of an /-BFB palindrome. Next, we show how 
/-BFB palindromes may be decomposed into /-block substrings. 

For a string a/e, denote by top (a) the maximum integer t such that at or at occur in 
a, and define top (e) = 0. For two strings a and (3, say that a <* (3 if top (a) < top (f3), and 
that a <* (3 if top (a) < top (ft. For example, for a = AB and f3 = ABCDDC, top (a) = 2 
and top (f3) = 4, therefore a <* (3. 

Definition 2 A string a is a convexed /-palindrome if a = e, or a = 7/^7 such that 7 is 
a convexed I -palindrome, (3 is an l-BFB palindrome, and 7 <* j3. 

While every /-BFB palindrome a is also a convexed /-palindromes (since a = eae), not 
every convexed /-palindromes is a valid BFB string. For example, a = AAABBAAA is 
a convexed 1-palindromes (choosing 7 = AA, (3 = ABBA), yet it is not a 1-BFB string. 
Instead, we have the following claim, proven in the SI document: 

Claim 1 A string a is an l-BFB palindrome if and only if a = e, a is an l-block, or 
a = (3^/(3, such that (3 is an l-BFB palindrome, 7 is a convexed l-palindrome, and 7 <* f3. 

From Definition 2 and Claim 1, it follows that an /-BFB palindrome a is a palindromic 

concatenation of /-blocks. In addition, for the total count 2n\ of a\ and a\ in a, a contains 

exactly ri/ /-blocks, where each block contains one occurrence of 07 and one occurrence of 

a\. When n\ is even, a is of the form a = /3i/?2 • • • (3 r n_-i(3 r nl3 r H/3rn_-i ■ ■ ■ P2P1, each ft is 

2 2 2 2 

an /-block. When n\ is odd, a is of the form a = ftft • • ■ ft zmi ft nj+ift zyi ■• -ft/ft- I n 
the latter case, say that ft is the center of a, where in the former case say that the 

center of a is e. Note that every /-block (3 appearing in a and different from its center 

1 We assume that genomic segments a satisfy a 7^ a, therefore strings of the form aaa will not be 
considered palindromes. 
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occurs an even number of times in a. If the center of a is an /-block, this particular block 
is the only block which appears an odd number of times in a, while if it is an empty string 
then no block appears an odd number of times in a. 

Now, let f3 be a 1-BFB palindrome with a count vector n(/3) = 2n = [2ni, 2ri2, . . . , 2nfc]. 
It is helpful to depict j3 so that each character 07 is at its own layer I, increasing with 
increasing /, as shown in Fig. 2a. As f3 is a concatenation of 1-blocks, we can consider the 
collection B 1 = {mi Pi, m^^-, ■ ■ ■ , m q j3q} of these blocks, where each count is the number 
of distinct repeats of ft in ft For example, for the string in Fig. 2a, B 1 = {2ft, fo, 2ft, 4ft}, 
where \B 1 \ = rii = 9, and ft is the center of ft Masking from strings in B 1 all occurrences 
of A and A, each 1-block ft = AftA in B 1 becomes a 2-BFB palindrome ft. Such 2-BFB 
palindromes may be further decomposed into 2-blocks, yielding a 2-block collection B 2 
(in Fig 2b, B 2 = {2ft, ft, 2ft}, where |-B 2 | = n 2 = 5). In general, for each 1 < I < k, 
masking in j3 all letters a r and ay such that r < / defines a corresponding collection of 
/-block substrings of ft Each collection B l contains exactly n; elements, as each /-block in 
the collection contains exactly two out of the 2n\ occurrences of 07 in the string (where one 
occurrence is reversed). The collection B l+1 is obtained from B l by masking occurrences of 
o\ and &i from the elements in B l , and decomposing the obtained (I + 1)-BFB palindromes 
into (I + l)-blocks. We may define B k+l = (where denotes an empty collection), 
since after masking in (3 all segments ai , . . . , we are l e ft with an empty collection of 
(k + l)-blocks. 

The algorithm we describe for the search variant of the BFB count vector problem 
exploits the above described property of BFB palindromes. Given a count vector n = 
[ni,ri2, ■ ■ ■ ,n>k], the algorithm processes iteratively the counts in the vector one by one, 
from nk down to n\, producing a series of collections B k , B k ~ l , . . . , B 1 . Starting with 
B k+1 = 0, each collection B l in the series is obtained from the preceding collection B l+1 
in a two-step procedure: First, (I + l)-blocks from B l+1 are concatenated in a manner 
that produces an (/ + 1)-BFB palindrome collection B' of size n; (B' may contain empty 
strings, which can be thought of as concatenations of zero elements from B l+l ). Then, B l 
is obtained by "wrapping" each element j3' 6 B' with a pair of ai characters to become an 
/-block (3 = aiP'ai. We will refer to the first step in this procedure as collection folding, 
and to the second step as collection wrapping. For example, in Fig 2d, the elements in 
B 4 = {4/3io} are folded to form a 4-palindrome collection B' = {2/3io/?io, e} of size 713 = 3. 
After wrapping each elements of B' by C to the left and C to the right, we get the 3-block 
collection B 3 = {2Cf3 w (3 w C, CC} = {2/3 8 ,/3 9 }. Algorithm SE ARCH-BFB (n) in Fig. 3 
gives the pseudo-code for the described procedure, excluding the implementation of the 
folding phase which is kept abstract here. We next discuss some restrictions over the 
folding procedure, and point out that greedy folding is nontrivial. Nevertheless, in the SI 
document we show an explicit implementation of a folding procedure, which guarantees 
that the search algorithm finds a BFB string provided that the input is a valid BFB count 
vector. 
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3.2 Required conditions for folding 

Recall that the input of the folding procedure is an /-block collection B and an integer n, 
and the procedure should concatenate all strings in B in some manner to produce an Z-BFB 
palindrome collection B' of size n. Since both /-blocks and empty strings are special cases 
of Z-BFB palindromes, when n > \B\ it is always possible to obtain B' by simply adding 
n —\B\ empty strings to B. Nevertheless, when n < \B\, there are instances for which no 
valid folding exists, as shown next. 

For a pair of collections B and B\ B + B' is the collection containing all elements in 
B and B' . When B" = B + B' , we say that B = B" - B' (note that B" - B' is well 
defined only when B" contains B'). For some (possibly rational) number x > 0, denote 
by xB the collection {[xmij j3±, [xm,2\ P2, ■ ■ ■ , [xm q \ j3 q }. The operation mod2 (B) yields 
the sub-collection of B containing a single copy of each distinct element f3 with an odd 
count in B. For example, for B = {2/3i, /?2, 5/?3, 6/^4}, mod2 (B) = {/?2, /5s}- Observe that 
B = mod2( J B) + 2 (\B). 

Claim 2 Let B be an l-BFB palindrome collection such that mod2 (B) = 0. Then, it is 
possible to concatenate all elements in B to obtain a single l-BFB palindrome. 

Proof. By induction on the size of B. By definition, mod2 (B) = implies that the counts 
of all distinct elements in B are even. When B = 0, the concatenation of all elements in 
B yields an empty string e, which is an /-BFB palindrome as required. Otherwise, assume 
the claim holds for all collections B' smaller than B. Let (3 G B be an element such that 
for every j3' G B, top(p') < top{/3), and let B' = B-{2p}. Note that mod2 (£') = (since 
the count parity is identical for every element in both B and B'), and from the inductive 
assumption it is possible to concatenate all elements in B' into a single /-BFB palindrome 
a'. From Claim 1, the string a = (3a' ' (3 is an /-BFB palindrome, obtained by concatenating 
all elements in B. □ 

Claim 3 Let B be an l-block collection. There is a folding B' of B such that \B'\ = 
|mod2(£)| + 1. 

Proof. Recall that B = mod2 (B) + 2 (\B). Since all element counts in the collection 
2 are even, mod2 (2 = 0, and from Claim 2 it is possible to concatenate all 

elements in 2 {\B) into a single Z-BFB palindrome a. Thus, the collection B' = mod2 (B) + 
a is a folding of B of size |mod2 (B)\ + 1. □ 

Claim 4 For every folding B' of an l-block collection B, |mod2 (B')\ > |mod2 (B)\. 

Proof. Let (3 G mod2 (B) be an /-block repeating an odd number of times m in B. 
Therefore, (3 appears as a center of at least one element f3' that occurs an odd number of 
times in B' (otherwise, (3 has an even number of distinct repeats as a substring of elements 
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in B', in contradiction to the fact that m is odd). Hence, for each /3 G mod2 (B) there is 
a corresponding unique element /3' G mod2 (B'), and so |mod2 (B')\ — |mod2 (B)\. □ 

The SEARCH-BFB(n) algorithm described in Fig. 3 tries in each iteration I to fold the 
block collection B l+1 obtained in the previous iteration into an (I + 1)-BFB palindrome 
collection of size n/. When ni > |mod2 (B l+1 ) | + 1, there always exists a folding as required: 
B l+1 maybe folded into a collection of size |mod2 | + 1 due to Claim 3, and additional 

ni — |mod2 (B l+l ^ | — 1 empty strings may be added in order to get a folding of size n;. On 
the other hand, when ni < |mod2 I, no folding as required exists, due to Claim 4. 

In the remaining case of n; = |mod2 the existence of an n/-size folding of B l+1 

depends on the element composition of B l+1 , as exemplified next. 

Consider the run of Algorithm SEARCH-BFB(n) over the input count vector n = 
[1,3,2]. Here, k = 3, and the algorithm starts by initializing the collection B 4 = 0. In the 
first loop iteration / = 3, and the algorithm first tries to fold the empty collection B 4 into a 
4-BFB palindrome collection containing n 3 = 2 elements. Since there are no elements in B 4 
to concatenate, the only way to perform this folding is by adding to B 4 two empty strings, 
yielding the collection B' = {2e}, which after wrapping becomes B 3 = {2CC} = {2/3i}. 
In the next iteration 1 = 2, and B 3 should be folded into a collection B' of size ri2 = 3. 
Among the possibilities to perform this folding are the following: B' a = {2/?i, e}, and B' b = 
{p!pi,2e}, which after wrapping become B 2a = {2B/?iB,BB} = {2/3 2 ,/3 3 }, and B 2b = 
{B/3i/3iB, 2BB} = {/3 4 ,2/3 3 }, respectively. Note that |mod2(# 2a )| = |mod2 (B 2b ) \ = 1. 
Nevertheless, it is possible to fold B 2a in the next iteration into the collection {fofizfo} of 
size n\ = 1, while B 2b cannot be folded into such a collection. The reason is that the only 
concatenation of all elements in B 2b into a single palindrome is the concatenation P^^ik: 
but since top (/3 4 ) = top (BCCCCB) = 3 > 2 = top (BB) = top (/3 3 ), Claim 1 implies that 
this concatenation is not a valid BFB palindrome. 

In the SI document, we define a property called the signature of a collection, and show 
how the exact minimum folding size depends on this signature. We also show how to fold 
a collection in a manner that optimizes this signature, and guarantees for valid BFB count 
vector inputs that the search algorithm finds an admitting BFB string. 



4 Running time 

For a count vector n = [m, . . . , rife], let N = rij be the number of segments in a string 

l<i<k 

corresponding to n. Let iV = log denote a number proportional to the number of 

i<j<fc 

bits in the representation of n, assuming each count rij is represented by O(lognj) bits. In 
the SI document, we complete the implementation details of algorithms for the decision, 
search, and distance variants of the BFB count vector problem, and show these algorithms 
have the asymptotic running times of O(N) (bit operations), O(N), and 0(N logN ) (under 
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some realistic assumptions), respectively. For the decision and search variants, these run- 
ning times are optimal, being linear in the input (for the decision variant) or output (for 
the search variant) lengths. 

In practical terms, this has a significant effect on our ability to evaluate copy number 
signatures of BFB when compared to the previous exponential-time algorithm [12]. To 
determine if a count vector consistent with BFB is in fact strong evidence for BFB, we 
have to check many count vectors. Analyzing the simulations we explain below required 
testing tens of millions of different count vectors, so even a small improvement in running 
time can have a large impact of the scope of analysis we can perform. 

But, the running time improvement with the new algorithm is not small. For example, 
a count vector that took 9 seconds with the previous algorithm can be processed by the 
new algorithm in 1.2xl0~ 5 seconds. A count vector that needed 148 seconds with the old 
algorithm now completes in 1.9xl0 -5 seconds. A count vector that was abandoned after 
30 hours with the old algorithm now takes only 8.1xlCT 6 seconds. Thus, the improvement 
in running time is not of merely theoretical interest. The earlier algorithm did not allow 
a thorough study of longer count vectors, while with the new algorithm such a study is 
possible. 

5 Detecting Signatures of BFB 

We can now describe the two features we will use to determine if a chromosome has under- 
gone BFB. The first feature is based on the fold-back inversions that BFB produces. For a 
given region, we can find all the breakpoints identified by sequencing and determine what 
proportion are fold-back inversions. We call this the fold-back fraction. The second feature 
relies on our algorithm that solves the BFB count vector problems we have posed. For a 
given contiguous pattern of copy counts, that is, a count vector, we can find the distance 
to the nearest count vector that could be produced by BFB using the distance metric we 
defined above. We call this the count vector distance. For a particular count vector, we 
define a score s that combines these two features: 

s = XS + (l-X)(l-f) (1) 

Here, / refers to the fold-back fraction, 5 refers to the count vector distance, and A refers 
to the weight we give to the count vector distance versus the fold-back fraction when 
calculating the score. When A = 1, we are only looking at count vector distance, whereas 
when A = 0, we are only using fold-back fraction and ignoring the count vectors. 

6 Results 

To determine whether our two proposed features could identify BFB against the complex 
backdrop of a cancer genome, we simulated rearranged chromosomes. Our overall goal 
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was to simulate cancer chromosomes that were highly rearranged yet had not undergone 
BFB to see if evidence for BFB appeared in them, suggesting that using such evidence 
would lead to false positives. Conversely, we also wanted to simulate chromosomes whose 
rearrangements included BFB to determine if a proposed BFB signature was sensitive 
enough to identify BFB when it occurred. Since it is not clear how to faithfully simulate 
cancer genome rearrangements, we used a wide range of simulation parameters so we could 
understand how different assumptions affect the features' ability to identify BFB. 

We began with a pair of unrearranged chromosomes and then introduced 50 rear- 
rangements to each. Each rearrangement was an inversion, a deletion, or a duplication. 
Duplications were either direct or inverted and could be tandem or interspersed. The type 
of each rearrangement was chosen from a distribution. In some chromosome pairs, we 
imitated BFB by successively duplicating and inverting segments of one end of one chro- 
mosome for each round of BFB. The number of BFB rounds varied from two to ten. Then, 
we calculated the copy counts and breakpoints for the chromosome pair and introduced 
error to the copy counts according to a random model and also randomly deleted or in- 
serted breakpoint observations. For each combination of rearrangement type distribution 
and number of BFB rounds, we simulated 5,000 chromosome pairs with BFB and 15,000 
without BFB. Complete details are in the SI. 

We first examined the usefulness of count vector distance alone in identifying BFB by 
setting A = 1 in our score function (Eqn. 1). For each chromosome pair, we found all 
contiguous count vectors of a given length and calculated their scores, as described above 
and in the SI. We used the minimum score s over all of these sub- vectors in the chromo- 
some as a score for the whole chromosome. Then, for varying thresholds, we classified all 
chromosomes with a score lower than the threshold as having been rearranged by BFB. 
The performance of this classification varied with the parameters used to simulate the 
chromosomes, but typical results can be seen in Figure 4a. The solid lines show ROC 
curves for different count vector lengths for the simulation with eight rounds of BFB and a 
distribution that yields roughly equal probabilities of the other rearrangement types. Con- 
sistent with previous observations, short count vectors that are perfectly consistent with 
BFB can be found in many chromosomes, even if BFB did not occur. So, even with a score 
threshold of zero, they would still be classified as consistent with BFB. For example, 63% 
of chromosomes without any true BFB rearrangements in Figure 4a had a count vector of 
length six perfectly consistent with BFB. 

In contrast, examining longer count vectors produced a better classification. For in- 
stance, setting the score threshold to .10, count vectors of length twelve could achieve a 
true positive rate (TPR) of 67% and a false positive rate (FPR) of only 10%. However, 
this performance must be considered in the context of an experiment seeking evidence for 
BFB. Chromosomes that have undergone BFB are probably rare. If only one in a hundred 
chromosomes tested underwent BFB, then a test with an FPR of even 1% will produce 
mostly false discoveries. Achieving this FPR with count vectors of length twelve with the 
chromosomes in Figure 4a would result in a TPR of only 16%. A more appropriate target 
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FPR for screening many samples, say .1%, could not be achieved with count vectors alone. 

Next, we incorporated fold-back inversions into the scoring function. We set A = .5, 
giving equal weight to fold-back fraction and count vector distance. ROC curves using 
this approach are shown by dashed lines in Figure 4a. Incorporating fold-back fractions 
into the scoring leads to better discrimination of chromosomes with and without BFB 
rearrangements; the test in Figure 4a that combines count vectors of length 12 and fold- 
back inversions can achieve a TPR of 48% with an FPR of .1% by setting the score threshold 
to .27. This suggests that it could detect BFB in a large dataset without being overwhelmed 
by false discoveries. 

Of course, these conclusions depend on our simulation resembling actual cancer rear- 
rangements and BFB cycles. A true specification of cancer genome evolution is unknown 
and in any case varies from cancer to cancer. Recognizing this complication, we repeated 
the analysis in Figure 4a for the different rearrangement distributions, number of BFB 
rounds, and count vector lengths. For each combination, we recorded the score threshold 
needed to achieve FPRs of .1%, 1%, and 5%, and the respective expected TPRs. The full 
results are shown in Dataset SI and ROC curves are shown in Figures Sl-5. Generally, 
different simulations showed the same trends. Fold-back inversions alone were better at 
identifying BFB than count vectors alone, but the combination of both features provided 
the best classification. By examining a wide range of simulation parameters, we illus- 
trate how changes in assumptions about cancer genome evolution and BFB influence the 
appropriateness and expected outcomes of tests for BFB. 

We applied our method to a publicly available dataset of copy number profiles from 
746 cancer cell lines [13]. We found three chromosomes with count vectors of length 12 
nearly consistent with BFB: chromosome 8 from cell line AU565, chromosome 10 from cell 
line PC-3, and chromosome 8 from cell line MG-63 (see SI). While the patterns of copy 
counts on these chromosomes do bear the hallmarks of BFB, our simulations suggest that 
labeling chromosomes as having undergone BFB based on these count vectors would lead 
to an FPR between 1% and 10%. Given that thousands of chromosomes were examined, 
many of which were highly rearranged, the consistency of these copy counts with BFB may 
be spurious. 

We also applied our method to paired-end sequencing data from seven previously pub- 
lished pancreatic cancer samples [14]. We estimated copy numbers from the reads and 
used breakpoints as reported by the original investigators. We examined count vectors of 
length 8 and chose a threshold score of .18, which would give an FPR of .1% based on sim- 
ulations where the non-BFB rearrangement types are roughly equally likely. We identified 
two chromosomes that showed evidence for BFB, both from the same sample, PD3641. 
The first was the long arm of chromosome 8. This chromosome was identified by the orig- 
inal investigators as likely being rearranged by BFB. Our analysis suggests that, barring 
rearrangements that differ significantly from any of our simulations, this chromosome did 
indeed undergo BFB cycles. We also found evidence for BFB rearrangements from a count 
vector spanning ten megabases on the short arm of chromosome 12 (Figure 4b). Thus, we 
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were able to recover evidence for BFB previously identified by hand curation. And by com- 
bining count vector and fold-back analysis, we found an additional strong BFB candidate 
that would not be apparent without modeling and simulation. 

7 Discussion 

Some 80 years after Barbara McClintock's discovery of the Breakage Fusion Bridge mech- 
anism, it is seeing renewed interest in the context of tumor genome evolution. Recent 
publications have claimed, based on empirical observations of segmentation counts and 
other features, that their data counts are "consistent with BFB". The main technical con- 
tribution of the paper is an efficient algorithm for detecting if given segmentation counts 
can indeed be created by Breakage Fusion Bridge cycles. That algorithm turns out to 
be non-trivial, requiring a deep foray into the combinatorics of BFB count vectors, even 
though its final implementation is straightforward and fast. Experimenting with the im- 
plementation reveals that in fact, (a) there is a big diversity of count- vectors created by 
true BFB cycles not all of which are easily recognizable as BFB; and, (b) at least for 
short count- vectors, it is often possible to create BFB-like vectors by non-BFB operations. 
Thus, being "consistent with BFB", and "caused by BFB" are not equivalent. Fortu- 
nately, our results also suggest that using longer count vectors, and additional information 
of fold-backs gives stronger prediction of BFB, even in the presence of noise, and diploidy. 
While assembly of these highly rearranged genomes continues to be difficult, recent ad- 
vances in long single-molecule sequencing will provide additional spatial information that 
will improve the resolving power of our algorithm. As more cancer genomes are sequenced, 
including single-cell sequencing, the method presented here will be helpful in determining 
the extent and scope of BFB cycles in the evolution of the tumor genome. 

8 Materials 

Details on the algorithm, and on the simulation methods are available in the accompanying 
supplemental information (SI). 

8.1 Code availability 

Java and Python code used to analyze chromosomes is available at www.bitbucket.org/mckinsel/bfb 

8.2 Estimating pancreas tumor copy number 

The pancreas tumor data was downloaded from the European Genome-Phenome Archive, 
accession number EGAS00000000064. The data was paired-end reads; each end was 37 
bases long. We aligned the reads with Bowtie [17]. Then we used readDepth [18] for 
segmentation and integer copy number estimation. 
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Figure 1: A schematic BFB process. 
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Figure 2: Layer visualization of a BFB palindrome /3 = aa, where a = 
ABCDDDDCBAAAABBAAAABC. A possible _ BFB sequence _ that _ produces 
a is ABCD__^_ ABCDD -t ABCDDDDCBA _ ->■ ABCDDDDCBAA -> 
ABCDDDDCBAAAAB ABCDDDDCBAAAABBAAAABC. n{a) = [9,5,3,4], and 
= 2n(a). Figures (a) to (d) depict layers 1 to 4 of /3, respectively. In each layer I, the 
/-blocks composing the collection B l are annotated as substrings of the form These col- 
lections are: B 1 = {2ft, /3 2 , 2/3 3 , 4/3 4 }, £ 2 = {2/3 5 , /3 6 , 2/3 7 }, 5 s = {2/3 8 ,/3 9 }, 5 4 = {4/3 10 }. 
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Algorithm: SEARCH-BFB(ra) 
Input: A count vector n = [m , «2 , ■ ■ ■ , 

Output: A BFB string a such that rt(a) = n, or "FAILED" if there is no such a. 

1 Set B k+1 i- 0. 

2 For ( «— fc down to 1 do 

3 Apply FOLD(B i+1 ,n ; )- If this operation has failed, return "FAILED". 

4 Otherwise, let B' be the output of FOLD(B i+1 , nj), and set B l to be the 
wrapping of B' . 

s Apply FOLD(B 1 , 1). If this operation has failed, return "FAILED", 
e Otherwise, for aa the single palindrome in the output collection of FOLD(f? 1 , 1), 
return a. 

Procedure: FOLD(B,n) 

Input: An /-BFB palindrome collection B and an integer n > 0. 

Output: A folding B' of B such that \B'\ = n, or the string "FAILD" if there is no 
such B' . 

i The implementation of the FOLD procedure is found in the SI document. 

Figure 3: An algorithm for the BFB count vector problem 
Figure 3: An algorithm for the BFB count vector problem 




Figure 4: Simulation and pancreatic cancer results, a) ROC curves for different count 
vector lengths with and without fold-back fractions for the simulation of eight BFB rounds 
and equally likely other rearrangements, b) Observed copy counts and copy counts com- 
patible with BFB on the short arm of chromosome 12 in pancreatic cancer sample PD3641. 
The presence of fold-back inversions and the count vector's consistency with BFB suggests 
that this portion of chromosome 12 underwent BFB cycles. 
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1. Properties of BFB Strings 

In this section, we prove Claim 1 from the main manuscript. To do so, we first formulate several 
auxiliary claims. 

Observation 1. Ifa^p,p = P'P", and f3" ™ P"j, then a /3 7 . 

Call a string a an l-t-string if for the count vector n(a) = [ni,ri2, ■ ■ ■ , ^fc], n r > if and only if 
I < r < t. Thus, an /-i-BFB string is an Z-BFB string a such that top (a) = t. Denote by a^ t the 
consecutive genomic region a^ t = 0707+1 . . . at (when t < I, a^ t = e), and observe that l-t-BFB strings 
always start with the prefix a^ t . 

Claim 3. Let l',l,t be integers and a/3 an I' -BFB string such that /3 is an l-t-string. Then, 

(1) If P starts with the prefix aij, then a^t — ->■ P (i.e. (3 is an l-t-BFB string). 

(2) If P ends with the suffix aij, then a^t — % p. 

(3) // P starts with the prefix ai yt , then a^ t — > P- 

(4) If P ends with the suffix ait, then ait ~^ P- 

Proof. When t < I, a^ t = P = e, and all four items in the claim are sustained in a straightforward 
manner. Similarly, when aP = ai> jt , then ft = a^ t and again all four items in the claim are sustained. 
Otherwise, t > I and there are some p, 7 such that 7 7^ e, p^y is an Z'-BFB string, and aP = /077. In 
particular, ay ^ is a proper prefix of a/3. Assume by induction that the claim is sustained with respect 
to all proper prefixes of aP (from Lemma 2 in [1], all such prefixes are T-BFB strings). Note that P, 7, 
and 77 are all suffixes of a/3 = pjj. Consider three cases: 1. /3 is a proper suffix of 7, 2. P is a proper 
suffix of 77 and 7 is a suffix of P, and 3. 77 is a suffix of p. 

1. P is a proper suffix of 7. In this case, 7 = j'P for some string 7' / e, therefore ^77 = pPj'j'p. 
From the inductive assumption and the fact that p(3 is a proper prefix of ppy = pj (which is in turn a 
proper prefix of a/3), pp sustains the claim. Therefore, 

(1) If P starts with the prefix a^t, then (3 ends with the suffix a^t, therefore a^t — -> P- 

(2) If P ends with the suffix ai t t, then (3 starts with the prefix a^t, therefore a^t — -> P- 

(3) If P starts with the prefix a^ t , then ft ends with the suffix a; jt , therefore a^ t ^> P- 

(4) If P ends with the suffix a>i t t, then ft starts with the prefix a^t, therefore a^t ^> (3- 

2. P is a proper suffix of 77 and 7 is a suffix of /3. In this case, there are some 71 and 72 such that 
71 / e, 7 = 7172, 77 = 71727271 and P = 727271- Thus, a/3 = ^77 = p 7 i 727271 = pPli- Here also, 

1 



2 



we get that pj3 is a proper prefix of a(3, and similarly as in the previous case the inductive assumption 
implies the correctness of the claim. 

3. 77 is a suffix of (3. In this case, there is some 7' such that (3 = 7' 77, and therefore a(3 = cry' 77. To 
show items (1) and (3) in the claim, assume that f3 starts with the prefix cf> such that either <fi = ai t t or 
4> = oci t t, respectively. It must be that is a prefix of j'j, since the first character of 7 is the reverse of 
the last character of 7, and thus cannot be included in (f>. Therefore, from the inductive assumption and 
the fact that cry' 7 is a proper prefix of cry'77 = a(3 (recall that 7 7^ e and therefore 7 7^ e), — >■ 7^. 
By definition, ^> 7^7 = /3, proving items (1) and (3) in the claim. 

To show items (2) and (4) in the claim, assume that ft ends with the suffix <p such that either cf> = a^ t 
or 4> = ai yt , respectively. Similarly as above, it must be that <f> is a suffix of 7. Note that case (2) of this 

— BFB — BFB — 

proof implies that (j) — > 7> and by definition (f> — > 77- In particular, is a prefix of 7, and therefore 
the string cry'</> is a proper prefix of a/3 = cry'77, and <p is the suffix of the suffix 7'^ of cry'c/>. From the 
inductive assumption, cj) ^> cjyy'. Thus, from Observation 1, and the fact that is a suffix of 7, we get 
that ^> 77 ^> 777' = /3, and items (2) and (4) in the clam follow. □ 

Claim 4. Let a be a BFB string, and let a [3a be a substring of a such that f3 contains no occurrences 
of a or a. Then, (3 is a palindrome. 

Proof. From Lemma 2 in [1], every prefix of a is a BFB string, and thus we may assume without loss 
of generality that af3a is a suffix of a. We prove the claim by induction over the length of a. Note that 
for getting a substring of the form a/3a, a must be of the form a = pjj, where 7 7= e (since strings of 
the form a^ t cannot contain both characters a and a). If 7 is a suffix of a /3a, then 7 ends with a, and 
does not contain any additional occurences of a or a. Therefore, 7 starts with a, and it must be that 
a(3a = 77, and in particular (3 is a palindrome. Else, af3a is a suffix of 7, therefore a(3a is a prefix of 
7. In particular, the prefix pa (3a of pj is a proper prefix of a (since 7 7= e). Since p is a BFB string 
(Lemma 2 in [1]), the inductive assumption implies that f3, and therefore f3, is a palindrome. □ 

Claim 5. Let a be a BFB string and 7 a palindromic concatenation of I -blocks, such that a contains 
ctl,tl&l,t as a substring and top (7) = t! <t. Then, 7 is a convexed l-palindrome. 

Proof. By induction on the number of /-blocks composing 7. If 7 is composed of zero Z-blocks, then 7 = e, 
which is a convexed l-palindrome by definition. Otherwise, 7 is of the form 7 = /3i/?2 • • • /3q/3 q +if3 q . . . foPi, 
where fa is an /-block for every 1 < i < q, and /3 q+ i is an Z-block in case 7 is composed of an odd number 
2q+l of blocks and (3 q+ i = e in case 7 is composed of an even number 2q of blocks. Let i be the minimum 
index such that top (Pi) = t'. Observe that 7 = 7' 7" 7', where 7' = ■ ■ ■ Pi-i is a concatenation of 
/-blocks such that top (7') < t' (from the selection of i), and 7" = f3i . . . f3 q (3 q +if3 q . . . is a palindromic 
concatenation of /-blocks with top (7") = t'. Since 7" is a substring of a, it is the suffix of some prefix 
a' of a. From Lemma 2 in [1], a' is a BFB string. From the fact that 7" starts with a^t' (as is a 
prefix of the /-t'-block /%), we get from Claim 3 that 7" is an /-BFB string, and in particular it is an 
/-BFB palindrome. In addition, observe that a contains aij'j'aij' = at'&l,? '-17 ' a i,f '-iff as a substring. 
Since (5;,f-i7 / a; i t'_i does not contain occurrences of a t ' or a t >, from Claim 4, d^/.^'a^'-i, and in 
particular 7', is a palindrome. Thus, from the inductive assumption, 7' is a convexed /-palindrome, and 
by definition 7 = = 7' 7" 7' is a convexed /-palindrome. □ 

Claim 6. Let /, £', t be integers such that /, t' < t. For every convexed I -t' -palindrome 7, a^ t — > oti^a^t. 

Proof. We prove the claim by induction on f". When t' < /, 7 = e is the only convexed /-t'-palindrome, 
and by definition a^t — -> dj^a/^. Otherwise, t' > /, and assume by induction the claim holds for every 
l,t" ,t such that t" < t! <t. By definition, 7 is of the form 7^7' ', where 7' is a convexed /-t"-palindrome 
such that i" < t', and (3 is an /-t'-BFB palindrome. From the inductive assumption, a^ t ^> a^tj'a^t, 
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and therefore a^t — > cti,tl'(Xl,t'- As f3 is an /-t'-BFB palindrome, {3 is of the form (3 = aa, where a is an 

BFB BFB 

/-t -BFB string. In particular, a^ t ' — > a, and from Observation 1 and the fact that a^ t — > Oii,tl a l,t' ^ 
we get that ai jt — > aj,t7 a — > a^a aaj'a^t = cty,t7 P7 a l,t = oci,tiai,t- □ 

Finally, we turn to prove the correctness of Claim 1 from the main manuscript. 

Claim 1. A string a is an l-BFB palindrome if and only if a = e, a is an l-block, or a = (3^(3, such 
that f3 is an l-BFB palindrome, 7 is a convexed I -palindrome, and 7 <* {3. 

Proof. By definition, if a = e or a is an /-block, then a is an /-BFB palindrome. Thus, it remains to 
show that when a is neither e nor an /-block, a is an /-BFB palindrome if and only if a = (3^/(3, such 
that f3 is an /-BFB palindrome, 7 is a convexed /-palindrome, and 7 <* (3. Let t = top (a). 

Assume that a is an /-BFB palindrome which is neither e nor an /-block. Therefore, a is a concate- 
nation of at least two /-blocks, and so a is of the form a = (3j(3, such that (3 is an /-block and 7 is some 
palindromic concatenation of /-blocks. Thus, f3 must start with the prefix aij and end with the suffix 
aij, and top (7) < t = top ((3). In addition, observe that ai^a^t is a substring of a, and from Claim 5, 
7 is a convexed /-palindrome, proving this direction of the claim. 

For the other direction, assume that a = f3^/3, such that f3 is an /-BFB palindrome, 7 is a convexed 
/-palindrome, and 7 <* f3. Therefore, top (f3) = t , and top (7) = t' < t. Since f3 is an Z-t-BFB string, 
it starts with the prefix a^ t , and being a palindrome it ends with the suffix a^f From Claim 6 and 
Observation 1, Pja^ t is an /-BFB string, and applying again Observation 1, (3j(3 = a is an /-BFB string. 
Being a palindrome, a is an /-BFB palindrome. □ 



2. Algorithm SEARCH-BFB 

This section completes the missing details in the description of Algorithm SEARCH-BFB in the main 
manuscript. We describe the FOLD procedure, prove the correctness of the algorithm, and analyze its 
running time. 

2.1. Additional Notation and Collection Arithmetics. In order to give an implementation of 
the FOLD procedure, we first add notation and definitions of some new entities, and observe related 
properties. For short, from now on we simply say a "collection" when referring to an /-BFB palindrome 
collection (in some cases we will explicitly indicate that the collection is an /-block collection). A 
collection containing a single element (3 will be simply denoted by (3, instead of {/?}. 

For two numbers t, t! and a collection B, B^ t ' t ^ denotes the sub-collection containing all elements (3 
in B such that t < top(/3) < t' . Denote = B^ and B <1 = B - B^ = B^ ^ . For a nonempty 
collection B, denote min t (B) = min{ top (/?)}, where min*(0) is defined to be 00. Say that an element 

(3 G B is minimal in B if top {(3) = min t (B). The collection B = B'nB" contains all elements appearing 
in both B' and B", where the count of each element (3 G B equals to the minimum among the counts 
of (3 in B' and B" . Say that B' C B if B' = B n B' . Notations of the form a will denote series 
a = cto, ai, a,2, ■ ■ ., and 3d denotes the prefix ao, ai, • ■ • , a<i of a. For an integer denote by d m the 

maximum integer d > such that m is divided by 2 d . For example, dg = <i_24 = 3, and dj = 0. Observe 
that d m = when m is odd, and otherwise d m = 1 + . d m can also be understood as the index of the 
least significant bit different from in the binary representation of to, and in particular d m < log 2 to. 

Observation 2. For two collections B, B' , 

mod2 (B + B') = mod2 (B + mod2 (B')) = mod2 (mod2 (B) + B') 
• = mod2 (mod2 (B) + mod2 {B')) 

= mod2 (B) + mod2 (B') - 2(mod2 (B) n mod2 (B')) 



4 



Table SI. The decomposition and signature of the collection B = {201,02,203,404} 
appearing in Fig. 2a. Here, r(B) = 3. 



a 




id 


TT 

Hd 







{20i,0 2 ,20 3 ,40 4 } 
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{03,2/3 4 } 
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04 


04 
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-1 
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• For an integer i > 0, mod2 (B + iB') = mod2 (B + B') when i is odd, and mod2 {B + iB') = 
mod2 (B) when i is even. In particular, mod2 (B — B') = mod2 (B — B' + 2B') = mod2 (B + 5') . 

• For two integers t and t' , mod2 (bW>^ = (mod2 (B))^ . 

Definition 7. A convexed /-collection of order q is an l-BFB palindrome collection A of the form 
A = {ai, 2ai, 4c*2, • • • , 2 q ~ 1 a q }, where a q <* Q g _i <*...<* ai. 

A convexed /-collection of order q A = {a\, . . . , 2 q ~ 1 a q } satisfies \A\ = 2 q — l. In addition, A = when 
c/ = 0, and when A ^ 0, mod2 (A) = ai and ^ = = {ct2, 2a^, . . . , 2 9 ~ 2 a g } is a convexed /-collection 
of order q — 1. It is possible to concatenate all elements in ^4 to produce a convexed /-palindrome 7a, 

where 7a = e if A = 0, and otherwise = 7^0174. In Fig. 2a, all 1-blocks besides the two repeats of 

2 2 

01 form a convexed 1-collection A = {02,203,404} of order 3, where = 04030402040304- 

Observation 3. For a convexed I -collection A and an integer m, |mod2 (mi) | = if either m is even 
or A = %, and otherwise |mod2 (mA) | = 1. 

Claim 7. Let A = {ai, . . . , 2 JI_1 aj, . . . , 2 r ~ 1 a r } and A' = {ai, . . . , 2- 7_1 a J } 6e /wo convexed l-collections 
(where A' C ^4, anc? i/ is possible that A' = For every number t, there is an integer x > and a 
convexed l-collection A such that (A — A') = 2 X A. In addition, if A' / then x > and \A\ < \A\ 

Proof. First, note that A — A' = {2- ? a :) ' + i, . . . , 2 r_1 a; r }. Now, let x = j if top (aj+i) < /, and otherwise 
let x be the maximum integer in the range j < x < r such that top(a x ) > t. Then, (^4 — A / ) << = 
{2 x a x+ i, . . . , 2 r ~ 1 a r } = 2 x {a x , . . . , 2 r ~ x ~ 1 a r }. Choosing A = {a x , . . . , 2 r ~ x ~ 1 a r }, the claim follows. 

□ 

Definition 8. Let B = {mi0i, m,202, • • • , m q f3 q } be an l-BFB palindrome collection. The decomposition 
of B is a series triplet (^B,L,H^, whose elements are recursively defined as follows: 

• B Q = B, and B d = \ (B d ^ - L d _i - H d -i) for d>0. 

• L d = mod2 (B d ). 

. H d = (B d -L d )^ t{Ld) . 
Denote by r(B) the minimum integer r such that B r = 0. 

Table SI gives the decomposition of the collection B 1 corresponding to Fig. 2a in the main manuscript. 

In what follows, let B be a collection, (b, L, its decomposition, and r = r(B). 

By definition, L d ,H d C B d . It may be observed that the count of each element in L d is exactly 1 
(by definition of the mod2 (•) operation), i.e. mod2 (L d ) = L d , the count of each element in B d — L d is 
even (since reducing Ld from Bd decreases by 1 the count of each element with an odd count in B d ), 
therefore the counts of all elements in H d and in B d — L d — H d are even (since nonzero counts in these 
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collections equal to the corresponding even counts in Bd — L d ), i.e. mod2 (B d — Ld) = mod2 (Hd) = 
mod2 (Bd — Ld — Hd) = 0. In addition, every single occurrence of an element (3 G Bd (and in particular 
every j3 G Hd or j3 G L d ), corresponds to 2 d repeats of /3 in B. 

Definition 9. For a collection B, t(B) = t is the non- decreasing series of numbers whose elements are 
given by to = oo, and td = min(min*(L d ), td-i)) for d > 0. 

The following observation may be easily asserted, in an inductive manner. 

Observation 4. For a collection B and every integer d > 0, Hd = (Bd — Ld)~ td+1 , B <td = 2 d Bd, and 
B [t d+U t d ) = 2 d (L d + H d ). 

Finally, we define the signature of a collection, which is derived from its decomposition and will serve 
as an optimality measure implying the folding restrictions over the collection. 

Definition 10. The signature of B is a series s = s(B), where so = \Lq\, and s d = \Ld\ — |Af-i| — 
+ max(s d _i, 0) for d>0. 

The last column of Table SI shows the signature of the exemplified collection. For two signatures 
s = so, si, . . . and s' = s' , s' 1; . . ., denote s < s' if s precedes s' lexicographically, i.e. there is some 
integer d > such that Si = s^ for every < i < d, and Sd < s' d . Denote s < s' if s < s' or s = s'. 
We will show that signatures can serve as an optimality measure for collections, where lower signature 
collections are always less restricted than higher signature collection with respect to folding possibilities. 

From now on, when discussing derived entities such a decompositions (^B, L, H^, signatures s, etc., 

we assume these entities correspond to the collection B discussed in the same context without stating 
so explicitly. When several collections are considered, these collections are annotated with superscripts 
(e.g. B',B*,B 3 , etc.), which also annotate their correspondingly derived entities (e.g. L' d ,s 3 , etc.). 

Claim 8. For every d>0, \Lj\ > max(s ( j, 0) and \Bd\ + \Lj\ — Sd> max(s,i,0). 

Proof. We first show the first inequality in the Claim. For d = 0, sq = \Lq\ > by definition. Assume 

I M I 

by induction \L^/\ > max(^/,0) for every 0<d' <d. Then, \L d \ = s d + |L d _i| + - max^, 0) > 

Sd + 2 > Sd- In addition, \Ld\ > 0, and so \L d \ > max(sd,0). The second inequality follows 
immediately from the first one, as \Bd\ + \Ld\ — Sd > \Bd\ > \Ld\ > max(s,i,0). □ 

Claim 9. For r = r(B), s r = — \ B r--t\+\ L r~i\ _|_ max (s r _i, 0) < 0, and Sd = for every d > r. 

Proof The inequality s r < follows immediately from Claim 8 and the fact that \L r \ =0. In addition, 
since B r = ^(B r ^\ — L r -\ — H r -\) = 0, we have that H r -\ = B r -\ — L r _i, and therefore s r = 
\L r \ - |L r _i| - + max(s r _i, 0) = _ l^-il+l^-il + max ( Sr _ 1) o). 

To show the second part of the claim, note that \Bd\ = \L d \ = \Hd\ = for every d> r. This implies 
that for every d > r, s d = \L d \ — l-^d-il — ^ Hd -^ + max(srf„i, 0) = max(s^_i, 0). Since we showed that 
s r < 0, we have that s r+ i = max(s r ,0) = 0, and inductively it follows that that Sd = for every 
d > r. □ 

Define the series A = A.(B), where Ao = 0, and A d = A^_! + 2 d_1 abs(s^_ 1 ) for d > (where 
abs(sd_!) is the absolute value of Sd-i)- 

Claim 10. For every integer d > 0, 

\B\ = 2 d (\B d \ + \L d \ - s d ) + A d > 2 d max( S(i , 0) + A d . 
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Proof. The fact that 2 d (\B d \ + \L d \ - s d ) + A d > 2 d max(s d , 0) + A d follows from Claim 8. The equality 
\B\ = 2 d (\B d \ + \L d \ — s d ) + A d is proven by induction on d. For d = 0, since Bq = B, sq = \Lq\, and 
Ao = by definition, we get that 2° (|i?o| + \Lq\ — sq) + A d = \B\. Now, assuming the claim holds for 
some d > 0, we show it also holds for d' = d + 1: 

\B\= 2 d (\B d \ + \L d \-s d ) + A d 

= 2 d {\B d \ + \L d \ - s d - abs(s d )) + A d+1 

= 2 d (\B d \ + \L d \-2max(s d ,0)) + A d+1 

= 2 d ((2\B d+1 \ + \L d \ + \H d \) + \L d \ - 2max(s d ,0)) + A d+l 

= 2 d (2\B d+l \ +2\L d+l \ -2s d+1 ) + A d+1 

= 2 d+1 (\B d+1 \ + |L d+1 | - s d+1 ) + Ad+i. 

□ 

Conclusion 1. For every d > r = r(-B) we /iaue t/iai = |L^| = 0, where from Claim 9 s d = /or 
d > r. Thus, Claim 10 implies that \B\ = —2 r s r + A r = A^ for every d > r. 

Claim 11. Let B and B' be two collections such that \B\ = \B'\ = n and s r _i = s' r _ 1 for r = r(B). 
Then, s < s' . 

Proof Since s r _i = s' r _ 1 , it follows that A r = A' r . From Conclusion 1, s r = From Claim 10, 

s' r = Ar 2 7 n + \B' r \ + \L' r \ > Ar 2 7 n = s r . If s' r > s r , then s < s ' and the claim holds. If s' r = s r , then in 
particular \B' r \ =0 and so B' r = 0. Thus, r(B') < r, and so for every d > r we have that s' d = s d = 0, 
and s' = s. □ 

2.2. Folding Increases Signature. This section is dedicated for proving the following claim: 

Claim 12. Let B' be a folding of an l-block collection B. If B ^ B' , then s < is' . 

The proof, given at the end of this section, is based on an observation that shows how to present a 
general folding as a series of a special kind of elementary foldings, and showing that such elementary 
foldings always increase the signature of the collection. 

Definition 11. Let B' be a folding of B. 

• Say that B' is a type I elementary folding if B' is of the form B' = B — m(2f3 + A) + ma, where 
f3 is an l-block, is a convexed l-collection, m > is an integer such that m(2/3 + i)CB, 
and a = is an l-BFB palindrome such that a ^ B. 

• Say that B' is a type II elementary folding if e ^ B and B' is of the form B' = B + me. 

Claim 13. Let B be a collection ofl-blocks. For every folding B' of B there is a sequence of collections 
B°, B 1 , . . . , Bi , where B° = B' , B^ = B, and for every < i < j, B l is a (type I or II) elementary 
folding of B l+1 . 

Proof. By definition, each element in a folding B' of B is either an Z-block from B, a concatenation of 
several ^-blocks from B, or e. The sequence B°, B 1 , . . . , B^ is built iteratively as follows. 

Initiate B° = B', and i = 0. As long as B % / B, we show how to compute B l+1 given the collection 
B l . The construction maintains the property that each computed collection B l is a folding of B. In 
the case where B % contains some composite Z-BFB palindrome of the form a = (3^a^, let m be the 
count of a in B l , and set B t+1 = B l — ma + m (2/3 + A). We may assume A ^ 0, since when 7a = £ 
we can choose A = e. Observe that B l+l is a folding of B (where the same sub-collection of /-blocks 
from B which composes the m copies of a in B % , composes the elements in the m repeats of 2/3 + A 
in B l+1 ), and that B l is a type I elementary folding of B l+1 . In addition, since the number of /-blocks 
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composing each element in A is less than the number of /-blocks composing a, after a finite number of 
such modification there will be no more composite palindromes in the collection. 

In the case where B l contains no composite palindrome, B % is a folding of B containing only /-blocks 
and or more e elements. If B l contains no e elements, then B l = B, and the process is completed 
choosing j = i. Else, for m the count of e in B % , set B t+l = B l — me, and therefore B % is a type II 
elementary folding of Bi + \. Note that B t+1 = B, completing the process for j = i + 1. □ 

Observe that the signature of a collection depends only in its decomposition, and is independent 
in the manner the collection was obtained. Therefore, from the above claim, in order to show that 
foldings necessarily increase signatures, it is enough to show that each elementary folding increases the 
signature. In what follows, we give several technical claims that will prove this property. 

Claim 14. Let B,B', and A be l-BFB palindrome collections and m > an integer such that B' = 
B + mA. Then, 

(1) For every < i < d rn , B[ = B { + ^A <u . 

(2) For every < i < d m , L\ = U (i.e. F' dm _ x = Ld m -i)- 

(3) For every < i < d m , H[ = Hi + p^k+i^). 

Proof. B = B and t = oo, therefore B' Q = B' = B + mA = B + ^A <to . Thus, the first item in the 
claim holds for i = 0, and the two other items hold trivially for every < i < 0. 

Assuming by induction that for some i < d m the first item holds for every < i! < i and the two 
other items hold for every < i! < i, we show that (1) B' i+1 = B i+1 + ^ T A <ti + 1 , (2) L\ = Li, and (3) 

H< = H i + %Ato+ 1 > t *). 

We start by showing (2). Since i < d rn , ^ is even, and so L\ = mod2 (B^) = mod2 + |A <t *) °=' 2 
mod2 (Bi) = Li. In order to show (3), note that L' ri = Li implies that t' i+1 = ij+i, therefore H[ °=' 4 
{B[ - L'i)- 1 '^ 1 = {Bi + f t A<^ - Li)~ ti+1 = {Bi - Li)^ u+1 + f t A^+^ °= 4 Hi + f x A^+^\ Finally, 
(1) is true since B' i+1 = \{B[ - L\ - H[) = \{B l + fA<^ - Li - Hi - ^A^+uU)) = ^{ Bl -Li- Hi) + 
2jtt(A<*' - A^+^) = B i+1 + 2^A<*i+ 1 . " " □ 

Claim 15. Let B' = B — m(2/3 + A) + ma be a type I elementary folding of B. Then, 
(2.1) V0 < i < d m , B[ = Bi — ^{2/3 + A)^ + |a<* % 

( 2 - 2 ) L 'd m ~i = L d m -i 

(2.3) V0 < i < d m , H> = H t - |(2/3 + Af^ + ^a^^. 

Proof. Let B" = B - m{2/3 + A). Therefore, B = B" + m{2/3 + A), and B' = B" + ma. From Claim 14, 

(1) For every < % < d m , Bi = B'f + f {2f3 + A) <u , and B\ = B'( + f Therefore, B\ = 
Bi-^{2^ + A) <u + m a <^. 

( 2 ) L 'd m -1 = L d m ~l = L dm-V 

(3) For every < i < d m , H L = H'l + §(2/3 + A) [u+1 ' u) , and H' L = H'{ + § a^ 1 '^. Therefore, 
H ' i = Hi- f (2/3 + A) [ti+lM) + ™ a [*i+uti) m 

□ 
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Claim 16. Let B,B', and C be l-BFB palindrome collections, A a convexed l-collection, and m and i 

I H f I I H I 

two nonnegative integers, such that (a) s^ = Si, (b) t' i+1 > ij+i, (c) |L'J + < |Lj| + Hf - — \C\, (d) 
B i+1 n C = 0, and (e) B' i+1 = B i+1 + C - mA. Then, B < s B' . 

Proof. We prove the claim by induction on the size of A. Let A = {a±, . . . , 2 r ~ 1 a r }, and denote 
A = mod2 (mA). Observe that A = ct\ when A / and m is odd, and otherwise A = 0. In 

addition, observe that mod2 + C) °= 2 mod2 +mod2 (C) - 2 (mod2 n mod2 (C)) = 

mod2 (B i+1 ) + mod2 (C) = L i+1 + mod2 (C). Therefore, 



L' i+1 = mod2 = mod2 (B i+1 + C- mA) 

°= 2 mod2 (B i+ i + C) + mod2 (mA) - 2 (mod2 + C) n mod2 H) 

= L i+ i + mod2 (C) + A - 2 + mod2 (C)) n i) 

Since (L i+1 + mod2 (C)) n A C i, it follows that > + |mod2(C) | + \A\ - 2\A\ = 

|L i+ i| + |mod2(C)|-|i|. Therefore, s' i+1 = \L' i+1 \ - {L^ - ^ + max(s' i , 0) (a > C) \L i+1 \ + |mod2 (C) | - 
|A| — |Lj| — ^ + |C| + max(sj, 0) = Sj+i + |mod2 (C) | + |C| — | A\. As both s' i+1 and Sj+i are integers, 
> Si+i + |mod2 (C) | + \C\ - \A\ + 1 > s i+1 (recall that \A\ < 1). When > s i+1 , s < s' and 
the claim follows. Otherwise, s' i+1 = Sj+i, and there is a need to continue and examine positions grater 
than i + 1 in the signatures of B and 

Note that for obtaining s' i+1 = Sj+i we must have that C = and A = Lj + i n A = ai, which implies 
that A / 0, m is odd, and L^ +1 = Lj+i + ai — 2ai = Lj+i — ai (thus a\ G Li+i, and in particular 
top(ai) > mm*(Lj + i) > ij+2)- Assuming by induction that the claim holds for every B" , B'" ,C , A' ,i' , 
and m' sustaining requirements (a) to (e) and | A'\ < \A\, we show the claim also holds for B, B', C, A, i, 
and m. 

(a) 

Now, since s£ = Si and s' i+l = Sj+i, requirement (a) in the claim holds with respect to i! = i + 1. In 
addition, 

i- +2 = min(min'(L' i+1 ),t- +1 ) > min(mm*(L i+ i - ai),t i+ i) 
> min(mm*(Lj + i),t i+ i) = t i+2 , 

thus requirement (b) also holds with respect to i' = i + 1. Furthermore, 



#m °=' 4 (B' l+1 - L' i+1 )^ = (B i+1 - mA - L i+1 + ai )*U* 
= (B i+1 - L m )-*»+ 2 - mi-'« + ai-**+a 

= Hi+! - (B i+1 - L l+l p+ 2 ^) - mA^'i+z + ai >f i+2 . 



Since ol\ £ A, the operation A — a\ yields a valid collection. In addition, note that the count of 
each element in Bi + \ — Lj+i is even, and since m is odd, Bi + \ contains at least m copies of a\ (as 
mA C Bj+i) and Lj+i contains exactly one copy of a±, Bi+± — Lj+i — (m — l)ai is a valid collection, 

in which the count of each element is even. Denote C = \ ^(-Bj+i — — (m — l)ai)[* i+2,<i + 2 )^ = 
\ ((Bi+i - Li + i)[*'+ 2 '**+ 2 ) - (m - l)ai < *'+ 2 ^. Now we can write 
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H' i+1 = H i+1 - (B i+1 - L i+l p+^) - + ai ^ +a 

l)ai <tli + 12 - mA- t ' i + 2 + ai-*<+ 2 
1) - ai-^+ 2 ^ - mA-'*+2 + ai-^+ 2 
l)ai - - ai)-**+ 2 

l-ff' I 

and since = \Li+\\ — 1 we get that \L' i+1 \ H ^ < 

— 1 H ^ + |C| < H t^- + |C1, and in particular requirement (c) holds with respect 

to C = C, and i' = i + 1. Moreover, by definition the top values of all elements in C are at least ti+2, 
and from Observation 4, Bi+2 = QpriB <ti+2 , hence the top values of all elements in Bi+2 are lower than 

ti + 2- Thus, Bi + 2 fl C = 0, and requirement (d) holds with respect to C = C, and i' = i + 1. 

From Claim 7, there is an integer x > and a convex /-collection A such that \A\ < \A\ and 
(A - ai) < *»+ 2 = 2*1. Therefore, 



. — H i+ i - 


(Bi+l — J 


= Hi + i - 


2C - (m 


= H i+ i - 


2C - (m 


= H i+ i - 


2C - (m 


< \H i+ i 


|-2|C|,. 



H'i+l 



= Hi+i 


- 2C 


= H+i 


- 2C 


= H+i 


- 2C 


= H+i 


-2C 



^>^' J 



and 



B 'i+2 ~ U B i+i ~ L i+i ~ H i+i) 



\{{B i+ i - mA) - (L i+ i - a x ) - (H i+1 -2C-mA + ai + m2 x A)) 
= \{B i +i-Li+i-Hi+i) + C-m2 x - 1 A 
= B i+ 2 + C - m2 x - 1 A. 

Since x > 0, m2 x ~ 1 is an integer. Therefore, requirement (e) holds with respect to C = C,A' = 
A, %' = % + 1, and m! = m2 x_1 . From the inductive assumption and the fact that \A\ < \A\, the claim 
follows. □ 

Claim 17. Let B' be a type I elementary folding of B. Then, B < s B' 

Proof. By definition, B' is of the form B' = B — m(2/3 + A) + ma, where m > is an integer, a and (3 
are /-BFB palindromes, and A = {a\, . . . , 2 r ~ 1 a r } is a convexed /-collection such that a = (3ja(3- Let 
q > be the index such that (3 £ L q + H q . From Observation 4, t q +i < top ((3) < t q , and therefore for 
every < i < q and every aj G A we have that (*) top (aj) < top (a) = top ((3) < t q < ti. Let d = d m . 
We consider two cases: (1) q < d, and (2) q > d, and show for each case that B < s B' . 

(1) q < d. In this case, condition (★) implies that for every < i < q, we have that (2/3 + = 
a [ti+i.*i) = 0. In addition (2(3 + A) [tq+1 ' tq) = 2(3 + A^ 1 , a^+i^) = Qj an d (2/3 + A) <tq+1 = 
A <tq+1 , a <tq+1 = 0. Thus, form Claim 15, we get that 

B' q +i = B q +i - ^+ T A <tq + 1 C1 =' 7 B q +i - A, 

where is a positive integer and A is a convexed /-collection, 
L' q = L q , 

H'q-l = H q -i, 

H' q =H q -%(2(3 + A^+i) + %a. 
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Observe that L' q = L q and H' q _ l = H q -i imply that s' q = s q and t' q+1 = t q+ i. Also, observe that 

\H' q \ < \H g \ — ^ < \H q \, therefore \L' q \ + < \L g \ + Applying Claim 16 with respect to entities 
B, B', C = 0, i, i = q, and m! = we get that B < s B' . 

(2) q > d. In this case, condition (*) implies that for every < i < d, we have that (2/3 + A)^ tl+l,tl ^ = 
a [h+i,ti) = 0^ and (2/5 + A) < * d = 2/3 + A, a<* d = a. Therefore, from Claim 15, 

K = Bj-f d (2f} + A) + f d a, 

L J-l = L d-1, 

H' d -i = Hd-i- 

Again, L' d _ 1 = L d -i an d H' d -i = H d -i imply that t' d = t d and s' d _ 1 = s^-i- Denote m' = p, and 
observe that m! is an odd nonnegative integer. Thus, mod2 (m'(2/3 + A)) °=' 2 mod2 (A) = ai, and 

mod2 (m'a) °=' 2 a. Since a ^ Bd — m'(2/3 + A) C B (by definition of elementary folding), we get that 
mod2 (B d - m'(2/3 + A)) n mod2 (m'a) = 0. Consequentially, 

L' d = mod2 (B' d ) = mod2 (£ d - m'(2/3 + A) + m'a) 

0bS - 2 T I <V T l~l \ | 

= L d + ai - 2(L rf n ai) + a. 



Hi 



Note that |L' d | = \Ld\ + 2 — 2|L d n a±\, and therefore s' d = \L' d \ — |^-il f~^~ + max ( s d-i>0) = 

|L d | + 2-2|L d nai|-|L d _i|-J%ii+max(s (i _i,0) = s d + 2- 2|L d n«i|. WhenL^Hai = 0, a^ = a d + 2, 
and so s < s' and the claim follows. Else, L^Hai = «i, = L rf — a± + a, therefore s d = Sd, and there 
is a need to continue and examine positions grater than d in the signatures of B' and B. 

In this remaining case s' d = s d , thus requirement (a) in Claim 16 holds with respect to B, B' and i = d. 
In addition, a\ G L d , implying that t^+i < mm*(L d ) < top(ai) < top ((3), and so q = d. Moreover, 
i'd+i < min t {L' d ) < top (a) = top(/3), and t' d+1 = min(mm*(L^), i d ) = min(mm*(L d — ai + a),t d ) > 
min(mm*(L £ /), td) = td+i, hence requirement (b) in Claim 16 holds with respect to B,B' and i = d. 
Now, 



H' d = (B' d -L' d ) 

= ((B d - m'(2/3 + A) + m'a) - (L d - a x + a))^+i 

= (B d - L d )-* d +i - 2m'/3 + (m' - l)a - m'A-^+i + ai-*^ 1 

= [(B d - L d )-* d+1 - (B d - L d )[' d+1 ^+i)) - 2m'/3 + (m' - l)a 

-m'A-*d+i + ai ^*d+i 
= B d - (B d - L d )[' d+1 '^+i) - 2m'/3 + (m' - l)a - m'A-^+i + a^^+i. 

Observe that each element in Bd — Ld has an even count, Bd contains at least m' copies of a\ 
(since m'A C and Ld contains exactly one copy of a\, therefore Bd — Ld — (m! — l)a\ is 

a valid collection in which each element has an even count (recall that m' is odd). Denote C = 

\ [(B d -L d - (ml - l)ai)[' d+1 '''d+i)) = \ ([B d ~ L d )^ l ' e ^) - (m' - l)af d+1 ). Next, we can write 



H' d = H d - (B d - L d )[* d+1 '*'rf+i) - 2m' (3 + (m! - l)a - m'A-^+i + a^^H-i 
(m! - l)af d+1 - 2m' P + (m' - l)a - m'A-^+i + ai -*w 
(ml - 1) (a x - af <d+1 ) - 2m' fi + (m! - l)a - m'A-^+i + ai-^+i 



H d - 


(B d 


H d - 


2C 


H d - 


2C 


H d - 


2C 



H d -2C - (m' - l)ai - 2m' (3 + (m' - l)a - m'(A - a^-^+K 
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Since m' > 1 (being an odd nonnegative integer), we get that \H' d \ = \H d \—2\C\—2m'—m' (A — ai)-*^ 1 < 

\H' I \fj I 

|Bd| — 2|C|. Therefore, |L d | + ^^- < \L d \ + + \C\, and condition (c) of Claim 16 holds with respect to 

B, B' , C, and i = d. In addition, B d +i °=' 4 ^-^^d+i ; anc | i n particular the top values of all elements 
in Bd+i are lower than ta+i- Since the top values of all elements in C are at least t d+ i, we have that 
Bd+i Pi C = 0, and condition (d) of Claim 16 holds with respect to B, B' , C, and i = d. From Claim 7, 
there is an integer x > and a convexed ^-collection A such that |A| < \A\ and (^4 — ai) <<d+1 = 2 X A, 
and so 

H' d = H d -2C- (m' - l)ai - 2m'/3 + (m' - l)a - m'(4 - ai)-^+i 

= H d -2C- {m! - l)ai - 2m'/3 + (m' - l)a - m' ((A - ai) - (A - «i) <f w) 
= H d - 2C - 2m' (3 + (m' - l)a - (m'A - «i) + m'2 x A, 



B 'd+\ = h(B' d — L d — H d ) 

= K(B d -m'(2f3 + A) + m'a)-(L d -a 1 + a) 

-{H d - 2C - 2m' '0 + {m! - l)a - (m'A - ai) + m'2 x A)) 



2 

Jnx— 1 



^(Bd-Ld-fldJ + C-m^- 1 ^ 



Bd +1 + C - m^- 1 ^ 



Since x > 0, is an integer. Therefore, requirement (e) in Claim 16 holds with respect to 

B, B', C,A,i = d, and m" = m'2 x ~ 1 , and the claim follows. 

□ 

Claim 18. Let B' = B + me be a type II elementary folding. For d = d m , we have 

(!) *d-i = 

(2) s' d = s d + l, 

(3) r' = d+l. 

Proof. Since the folding is elementary, e ^ B, and for every i > we have ij > 0. Therefore, e <ti = e 
and = 0. From Claim 14, we get that 

L [d-\ = L J-1, 
H 'd-1 = H d-1- 

From L' d _ 1 = L d _\ and & d _ x = H d -i, it follows that s^_ 1 = Sd-i- Since p is an odd integer (by 
definition) and e $ L d C B, L d = mod2 (B d ) = mod2 (S d + p £ ) °= 2 mod2 (B d + e) °= 2 L d + e - 
2(Ld n e) = Ld + £■ If d = then s d = |L d | = |L d | + 1 = Sd + 1) an d otherwise s' d = |L d | — |L'd-il ~~ 
+ max(s / d _ 1 ,0) = \L d \ + 1 - |Ld-i| - + max(a d _i,0) = s d + 1. Finally, as £ £ L' d (and 

in particular r' > d), it follows that t' d+1 = min 1 ^) = top(e) = 0, B' d+1 °=' 4 ^ Fr B <0 = 0, and so 
r' = d+l. ' □ 

Finally, we prove the main claim in this section. 

Proof of Claim 12. The correctness of the claim follows immediately from Claims 13, 17, and 18. □ 
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2.3. The FOLD Procedure. Using the notation and definitions given in the previous sections, we 
now give an explicit description of the FOLD procedure. 

Procedure: FOLD(B, n) 

Input: An i-BFB palindrome collection B and an integer n > 0. 

Output: A minimum signature folding B' of B such that \B'\ = n, or the string "FAILD" if there is no such B' . 

1 If |S| < n then return B + (n- \B\)e. 

2 Else 
3 
4 
5 
6 
7 



8 
9 

10 



Let s = s(B) and A = A(B). 

If there exists < d < d\ B \- n such that n>2 d max(sd + 1,0) + Ad then 

Let d be the maximum integer sustaining the condition above. Set B' <— B + 2 d e. 
While \B'\ > n do 
|_ Set B' «- RIGHT-FOLD(B'). 

Set B' <r- B' + (n- \B'\)e. 
Return B' 

Else return "FAILED" 



Procedure: RIGHT-FOLD (B) 

Input: An i-BFB palindrome collection B. 

Precondition: Let (^B, L, H^j be the decomposition of B, and r = r(B). There is an integer < g < r such that 

H g / 0, L g 0, and for every g < i < r, Hi — and L x 0. 
Output: A folding B' of B of size \B\ - 2 r . 

1 Let g be an integer as implied from the precondition (note that g is unique), j3 a minimal element in H g , 

A — {qi, 2o?2, • • • , 2 r_9_1 a r _g} a convexed /-collection such that an £ I/ 9 +i-i for each 1 < i < r — g and o?i is a 
minimal element in L g , and a = /?7a/3- 

2 Return the collection B' = B - 2 9 (2/3 + A) + 2 9 a. 

In general, it is easy to assert that when the precondition holds, the returned collection B' from the 
RIGHT-FOLD procedure is a folding of the input collection B, where each one of the 2 9 copies of a in 
B' is obtained by concatenating all elements in A and two copies of (3. Since a right-folding adds to the 
collection 2 9 copies of a while reducing 2 9 repeats of the collection 2/3 + A of size 2 + 2 r_ff — 1 = l + 2 r ~ 9 , 
the size of the folded collection B' has decreased by 2 r with respect to the size of the original collection 
B. 

2.3.1. Right- folding Properties. In this section we show certain characteristics of right-foldings. 
Claim 19. There is a right folding of a collection B if and only if s r < 0. 

Proof. For the first direction of the proof, assume that there is a right folding B' of B. From Claim 8, 
\L g \ > max(s g ,0). Since H g / 0, we get that — \L g \ — +max(s 9 ,0) < 0. This, in turn, implies that 
■s g+ i = |L ff +i| — \L g \ — -if 1 + max(s s ,0) < |L ff +i|. If r = g + 1, then s r = s 9 +i < |£g+i| = 0, and the 

\H I 

claim follows. Otherwise, L g+ \ / 0, and in particular — |L g +i| — g 2 +max(s 9+ i,0) < 0. Inductively, 
this shows that s r < 0. 

For the second direction, assume that s r < 0. Assume that for some i < r we have that — |Lj| — 
+ max(sj, 0) < 0, and that = ill and Ld ^ for all i < d < r. Note that this requirement holds 
for i = r — 1, since — |L r _i| — i^zii + max(s r _i, 0) = \L r \ — |L r _i| — l^tzii + max(s r _i, 0) = s r < 0, 
and there are no integers d such that r — 1 < d < r. If Hi ^ 0, then also Lj / (since Lj = implies 
that min*(-Lj) = oo, and by definition Hi = (Bi — Lj)-°° = 0), and therefore the requirements for the 
existence of a right-folding hold for g = i. Else, Hi = 0, and so — \Li\ +max(sj, 0) < 0. This implies that 
Li / and that i / (as \Lq\ = sq), and so \Li\ > max(sj, 0) > Si = — |Lj_i| ^ + max(sj_i, 0), 
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and we get that — — + max(sj_i,0) < 0. Inductively, for some i' < r, it must hold that 
Hy / 0, Hd = for every i! < d < r, and L& / for every i' < d < r, meeting the requirements for the 
existence of a right folding for g = i' . □ 

Throughout the remaining of this section, assume that B is a collection satisfying the pre-condition 
in Procedure RIGHT-FOLD, and let B' = B - 2^(2/3 + A) + 2»a be the output of the procedure (where 
g, r, (3, A = {a\, . . . , 2 r ~ 9 ~ 1 a r - g }, and a are as defined in the procedure). Note that when a ^ B, B' 
is also a type I elementary folding of B. We later show that all right-foldings preformed in line 7 of the 
FOLD procedure are elementary. 

Claim 20. If B' is an elementary folding of B, then 

(1) L' g _ 1 = Lg-i, and L' g = L g - a\ + a. 

(2) H' g _ x = H g - 1} and H' g = H g - 2/3. 

(3) For every g < i < r , L' i = Li — on- g , and H[ = Hi = 0. 

(4) B' r = B r = % (thus r' < r). 

(5) \B'\ = \B\ - 2 r . 

Obs.4 

Proof. We start by showing the first two items in the claim. Since 

P G H g C B <1 <>, it follows that 

for every ctj G A and every < i < g, top(aj) < top (a) = top (/3) < t g < ij. Therefore, from 
Claim 15 and the fact that d^g = g, we get that L' g _ l = L g -\ (and in particular t' g = t g ), H' g _ 1 = H g -i, 

and B' g = B g - (2/3 + A) + a. Since mod2 (2/3 + A) = a x G L g , we have mod2 (B g - (2/3 + A)) °= 2 

mod2 (B g + ai) °= 2 mod2 (B g ) + ct\ - 2(mod2 (B g ) n a\) = L g + a\ - 2(L g n «i) = L g - cm. As 

a i Lg + ai (follows from a B), we get that = mod2 = mod2 - (2/3 + A) + a) °= 2 
mod2 ((L s — «i) + a) = L 9 — ai + a. By definition, «i is a minimal element in L g , therefore top (ai) = 
min l (Lg), and so top(ai) < min ± (L g — a±). In addition, top(ai) < top (a), and therefore top(a±) < 
min(min*(L s — a±),top (a)) = mirf'(L g — ol\ + a) = min t (L' g ). On the other hand, top (/3) = top (a), and 

therefore min t (L' g ) = mirf(L g — ct\ + a) < top ((3). From the minimality of /3 in H g , H g = f{^ top ^ = 

((Bg-Lg)- mint ( L ^)- tOP ^ = ((Bg - Lg)- tOP( - ai ^)- tOP ^ = ( Bg — Lg ) ~ 1 ) ' t(3 p(P) ) = (B g ~ L g)- tOP ^\ 

Since top (ax) < miit(L' g ) < top (p), we get that H g = (B g - L g )^ top ^ C (B fl - L fl )^ m<n TO C - 
£ 9 )>to P K) = and go (Bg-L g )^ mmt ^ = H g . In addition, [3> mmt ( L ' s ) = /3, and (A-ai)^ ro<n '( I 'i) = 
(since for all i > 0, top {an) < top(oti) < miit(L' g )). Thus, we get that H' g = (B' g - L' g )- mint ^ L '^ = 

((B 9 -2/3-A+a)-(L 9 -ai+a))^ mm *( L 9) = (B g -Lg)^ m * n *( L 'J -2/3^ mm *( L ^ -(A-ai)^ mmt ^ = H g -2/3, 
as required. 

Next, we turn to show item 3 in the claim, which is relevant only for the case where g < r — 1. We 
prove this item inductively for all g < i < r. Note that B' g+1 = \(B' g — L' g — H g ) = \((B g - 2/3 - 
A + a) - (L g - a\ + a) - (H g - 2/3)) = \((B g - L g - H g ) - (A - a\)) = B g+ i - \A. Now, assume 
that for some g < i < r , B[ = Bi — ^^A. Note that cti- g G Li, and in particular Lj n ctj-p = oti- g . 

Therefore, L\ = mod2 (B'A = mod2 (Bi - ^A) °=' 2 mod2 (Bi) + mod2 (^A) - 2(mod2 (Bi) n 
mod2 (~r^A)) = Li + cti- g — 2(Li n cti-g) = Li + a>i- g — 2cti- g = Li — cti- g , as required. In addition, 
since mvd(L'A = min l (Li — on- g ) > min 1 ^) = top(a>i- g ), we get that H[ = (B^ - ^)> mm *( L ») = 
((Bi - ^iM) " (Li ~ ai- g ))^ mint W C (Bi -Li- (^A - a ^g))^ mmt ^ Cff i = fl, and so H[ = 0, 
as required. Finally, it follows that B' i+1 = \(B\ - L\ - H'A = \((Bi - ^A) - (Li - oti- g ) - Hi) = 
Bi + \ — ^(^j^ A — cti-g) = Bi+i ~ 2 i+ 1 i- 9 ^ as required for the next inductive step. 

Item 4 in the claim follows from the fact that B' r = B r — T^h^A = — = 0, and item 5 is obtained 
from the fact that \B'\ = \B\ - 29(2 + \A\) + 2» = \B\ - 29(2 + T'-9 - 1) + 2» = \B\ - 2 T . □ 
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Claim 21. If B' is an elementary folding of B, then s' r _ 1 = s r _i and s' r = s r + 1. In addition, r' < r, 
where if s' r < then r' = r. 

Proof By definition 10, the values in the series s 1 depend only on sizes of collections in L' and H' . 
These sub-collection sizes may be inferred from Claim 20, and their assignments in definition 10 imply 
the correctness of the claim in a straightforward manner. □ 

Let (3 be a palindrome obtained by concatenating zero or more /-blocks. If /3 is obtained by con- 
catenating an odd number of blocks, (3 is of the form (3 = Pifo ■ ■ ■ j3 q ^\f3 q f3 q ^\ . . . faPi (where each 
is an /-block), whereas if (3 is obtained by concatenating an even number of blocks it is of the form 
(3 = ■ • • (3q-i(3q£f3qf3 q -i ■ ■ ■ fofii- Call f3 q or e respectively the center of (3, in these two cases. Note 
that a center of an /-block (3 is (3. 

Definition 12. Say that an l-BFB palindrome collection B has unique centers if all elements in col- 
lections of the form are I -blocks, and for every (3 £ and (3' G L' d (for some possibly equal integers 
d and d' ) such that j3 / f3' , the centers of (3 and (3' differ. 

Claim 22. If B has unique centers then B' is an elementary folding of B, and B' has unique centers. 

Proof. To prove the folding is elementary, we need to show that a ^ B. Note that (3 G H g , a\ G L g , 
top (a) = top(f3), and the center of a is the the center of ot\. Assume by contradiction that a G B. 
Since top (a) = top ((3), Observation 4 implies that a G L g + H g . Since a is not an /-block, a ^ H g . 
Since a± and a have the same center, and since a\ G L g and B has unique centers, it follows that 
a ^ L g , leading to a contradiction. 

The fact that B' has unique centers follows from the contents of collections in the series L 1 and H 1 ', 
as given in Claim 20. □ 

Claim 23. Let B be an l-BFB palindrome collection with unique centers. Then, for i = —s r — 
min(s r _i,0), it is possible to produce a series of collections B° = B, B 1 , B 2 , . . . , B l , each collection 
B^ obtained by applying an elementary right-foldings over the preceding collection B^~ l . In addition, 
for every < j < —s r , 

• \B*\ = \B\ - 2 r j, 

• Sr = S r + j , 

and for each —s r < j < —s r — min(s r _i, 0), 

• r 3 < r — 1, 

• \Bi\ = \B\+2 r - l {s r -j), 

• 4-1 = S r-l +3 + s r- 

Proof. From Claim 9, s r < 0. Note that s r = — |L r -i| — — + max(s r _i, 0) < max(s r _i,0). When 
s r = 0, max(s r _i,0) > 0, and so min(s r „i,0) = and in particular — s r — min(s r _i,0) = and the 
claim holds trivially. Otherwise, s r < 0, thus — s r — min(s r _i,0) > 0, and we continue to assert the 
correctness of the claim. 

In the remaining case, s r < 0, and from Claims 19, 22, and 21 there is an elementary right-folding B 1 
of B° = B with unique centers, such that s r 1 _ 1 = s r _i and s}. = s r + 1. Note that r 1 < r, where s r < 
implies that r\ = r. Similarly, it is possible to apply a series of a total amount of x = —s r right-foldings 
B = B°, B 1 , . . . , B x , where for every j < x we have that rj = r and r x < r, and for every j < x we have 
that s r ?_ 1 = s r _i and s r = s r + j. Since each such a right-folding decreases the size of the collection by 
2 r elements, \B 3 \ = \B\ — 2 r j, hence the first part of the claim. 
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After performing x = — s r right-foldings, we get the collection B x for which r x < r, = s r _i, 

si = s r + x = 0, and |J3 X | = \B\ — 2 r+1 x = \B\ + 2 r+1 s r . If — min(s r _i, 0) = 0, then the second part 
of the claim follows immediately. Else, — min(s r _i,0) = — min(s^_ 1 ,0) > 0, thus l = s r -\ < 0, 
and max(s^_ 1 ,0) = 0. Since = s x = -\L X _ X \ - \Mi^A + max (s*_ 1 ,0) = -\L X _ X \ - \Mi^A^ it 
follows that L x _ 1 = H x _ 1 = 0, and therefore r x < r — 1. On the other hand, from Claim 9 and the 
fact that s x ._ l / we get that r x > r — 1, and thus r x = r — 1. As above, it is possible to apply 
additional consecutive y = —s x _ 1 = — s r -i = — min(s r _i,0) right-foldings, where each such folding 
maintains the signature values at positions to r — 2, increases by 1 the signature value at position 
r — 1 with respect to the preceding collection in the series, and decreases the collection size by 2 r ~ 1 . 
Hence, for —s r < j < —s r — min(s r _i,0), we have that r J < r — 1, \B^\ = \B X \ — 2 r ~ 1 (j — x) = 
\B\ + 2 r s r - 2 r ~ 1 (j + s r ) = \B\ + 2 r ~ 1 (s r - j), s/_ 2 = s r _ 2 , and = $r-i +3 ~ x = s r-i + j + s. 



as 



required. □ 

2.3.2. Correctness of the FOLD Procedure. Throughout this section, B and n correspond to an /-block 
collection and an integer given as an input to the FOLD procedure. When n / \B\, denote d! = d\B\- n - 

Claim 24. If there is a folding B' of B of size n / \B\, then there is an integer < d < a" such that 
s' d _ l = Sd-i, s' d > Sd + 1, and n > 2 d max(s^ + 1,0) + A^. In addition, if d < d' , then s' d > s^ + 2. 

Proof. Assume there is a folding B' of B of size n / \B\, and let d > be the integer such that 
Sd-i = s'd-i an< ^ s 'd — s d + l (whose existence is implied by Claim 12). Since s^-i = *d-i> ^ follows that 

ni m in , Clm.10 

A d = A' d . Thus, n = \B'\ = W 2 d {\B' d \ + \L / d \-s' d )+A / d > 2 d max(s' d ,0)+A / d > 2 d max(s d +l, 0)+A d . 
In addition, \B\ = 2 d (\B d \ + \L d \ - s d ) + A d , therefore \B\-n = 2 d (\B d \ + \L d \ -s d - \B' d \ - \L' d \ + s' d ). 
Since all parameters in the right-hand side of the latter equation are integers, \B\ —n divides by 2 d , and in 
particular d < d'. Furthermore, if d < d' , then is even, and therefore \B d \ + \L d \ — s d —\B' d \ — \L' d \+s' d 
is also even. As \B d \ + \L d \ is even, as well as \B' d \ + \L' d \, it follows that s' d — s d has to be even. Since 
s' d > s d , it follows that s' d > s d + 2. □ 

Claim 25. If there is a folding of B of size n then FOLD(B,n) returns such a folding B' , and otherwise 
FOLD(B,n) returns "FAILED". In addition, if n ^ \B\ and FOLD(B,n) has returned B' , then for the 
maximum integer < d < d' for which n > 2 d max(s d + 1,0) + A d (whose existence is guaranteed by 
Claim 24 ), s' d _ 1 = s"d-i and r' < d + 1, and if r' = d+1 then s' d = 5^ + 1 in case d = d' and s' d = s d + 2 
in case d < d! . 

Proof. When there is no folding of B of size n, then in particular n / \B\, and the procedure does not 
halt at line 1. In addition, from Claim 24, the condition in line 4 does not met, and the procedure 
returns "FAILED" in line 10 as required. 

Else, there is a folding of B of size n, and we show that the procedure finds such a folding sustaining 
the stated requirements. When \B\ <= n, the FOLD procedure halts by returning B + (n — \B\)e in 
line 1, which is in particular a folding of B of size n as required. In addition, if \B\ < n, we have 
from Claim 18 that s' d _ 1 = Sd-i, s' d = s d + 1, and r' = d + 1, thus the remaining requirements in the 
claim hold. Otherwise, n < \B\, and from Claim 24 the condition in line 4 holds, therefore in line 5 of 
the FOLD procedure, the value of the parameter d is selected to be the maximum integer in the range 
< d < d' such that n > 2 d max(s d + 1,0) + A d . 

Let B° = B + 2 d e be the value of the collection B' after executing line 5. Thus \B°\ = \B\ + 2 d , and 
from Claim 18, we have that 

(1) s~ti = sVi, 

(2) 8° d = s d + l, 

(3) r° = d+ 1. 
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From the proof of Claim 18 and the fact that B is an /-block collection it can be seen that B 
has unique centers. From Conclusion 1, s d+1 = Ad+ 2d+ [ B ■ = Ad+2 ahs (^d+i)-\B\-2 ^ p rom Claim 23, 
the collection B° can undergo a series of i right-foldings producing the sequence B°, B 1 ,. . . ,B i , where 
i = — s d+1 — min(s^, 0). The size of the collection B l according to Claim 23 is \B i \ = \B°\ + 2 d {s d+1 — 
i) = {\B\ + 2 d ) + 2 d {2s d+1 + min( S °, 0)) = \B\ + 2 d + 2 d ( ±<+2 d *»(s*+i)-\B\-2* + min( ^ + ^ ^ = 

Arf + 2 d (abs(s^ + 1) + mm(sd + 1,0)) = A^ + 2 d m.ax(sd + 1,0). From the condition in line 4, n > 
2 d m.ax.(sd + 1,0) + = \B % \, and in particular there exists some < j < i such that < n. The 
sequence of right-foldings computed by the loop lines 6-7 is a prefix of such a right-folding sequence (i.e. 
after x iterations of the loop, B' = B x ), where the loop terminates after j iterations for a minimal integer 
j such that \B 3 \ < n. After executing line 8, B' is a folding of B of size \B'\ = \B 3 \ + (n — \B 3 \) = n, 
and so the output B' of the procedure is a folding of B of size n, as required. 

To complete the proof, we need to show that when n < \B\, s' d _ 1 = s^-i an d r ' < d + 1, and if 
r' = d + 1 then s' d = Sd + 1 in case d = d' and s' d = + 2 in case d < d' . To do so, we consider two 
cases for the number of loop iterations j conducted by the procedure. Note that j > 0, since in the first 
iteration we have that \B°\ = \B\ + 2 d > n. 

!• < j < — s d+l . In this case, Claim 23 and the loop termination condition imply that n > \B^\ = 
\B°\ - 2 d+1 j = \B\ + 2 d - 2 d+1 j = \B\ + 2 d (l - 2j), and that n < \B^ l \ = \B\ + 2 d (l - 2(j - 1)), 
therefore, 2j — 3 < ^ d n < 2j — 1. Note that when d = d 1 , ^ B ^ d - is odd, hence ^ B ^ d 11 = 2j — 1, whereas 
when d < d! , " ^ s even > an£ i = — 2. 

In the case where d = d', \B^\ = \B\ + 2 d (l - 2j) = \B\ - 2 d ( ^ B ^ J n ) = n, thus no e elements are added 
to the collection in line 8 of the procedure and the returned collection is B' = BK From Claim 23, 
r' = r° = d + 1, and s' d = s d , i.e. s' d _ 1 = = s^-i and s' d = s d = Sd + 1, and the claim follows. 

In the case where d < d' , {&{ = \B\ + 2 d (l-2j) = \B\ - 2 d (2j - 2 + 1) = \B\ - 2 d (^ ri + 1) = n-2 d , 
thus after line 8 of the procedure the returned collection is B' = B^ + 2 d e. It may be asserted that e is 
the unique minimal element in B d (as all other elements are Z-blocks with higher top values), and thus 
this element participates in the right-folding that transforms B° to B l . Therefore, for each 1 < j 1 < j, 
e ^ B 3 , and in particular B' is a type II elementary folding of B 1 . From Claim 18, r' = d + 1, 
s' d _ 1 = s d _ x = Sd-i, and s' d = s d + 1 = Sd + 2, hence the claim follows. 

2. —s d+ \ < i < ~ s d+i ~ mhi(s°,0). In this case, Claim 23 and the loop termination condition 
imply that n > \B 3 \ = \B°\ + 2 d (s d+1 - j) = (\B\ + 2 d ) + 2 d (s d+1 - j) = \B\ + 2 d (s d+1 - j + 1), and 
n < IBJ'" 1 ! = |B| + 2 d (s° +1 -j + 2). Therefore, -s° d+1 +j-2< < -s° d+1 + j - 1. Since is 

an integer, it follows that 2d = — s d+1 + j — 1, therefore l-B^) = n, and consequentially after line 8 of 
the procedure the returned collection is B' = B 3 . From Claim 23, r' < d, and s' d _ 1 = s d _ 1 = s^-i, and 
the claim follows. □ 

Finally, we now prove the correctness of the FOLD procedure, as formulated by Claim 26. 

Claim 26. Let B be an l-block collection and let n > be an integer. FOLD(B, n) returns a folding 
B' of B of size n if such a folding exists, and otherwise it returns "FAILED". In addition, for every 
l-block collection B* such that \B\ = \B*\ and s(B) < s(B*), if there is a folding B'* of B* of size n, 
then FOLD(B,n) returns a collection B' such that s(B') < s(B'*). 

Proof. Claim 25 proves the first statement in Claim 26, thus it remains to show that for every /-block 
collection B* such that \B\ = \B*\ and s < s*, if there is a folding B'* of B* of size n, then FOLD(B,n) 
returns a collection B' such that s' < s'* . 
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First, note that when n = \B\ = \B*\, then in particular B* and B are minimum signature n-size 
foldings of B* and B, respectively (Claim 12), and thus B < s B'* for every n-size folding B'* of B*. 
Since in this case FOLD(I?,n) returns B, the claim follows. Otherwise, n / \B\, and we first show 
that FOLD (B,n) returns a folding B' of B of size n if such a folding exists, and otherwise it returns 
"FAILED" . 

In the reminder of this proof we assume that n 7^ \B\ = \B*\, and note that d! = d\B\— n = d\B*\-n- 
Since s < s* , either s = s*, or s < s* and there is an integer i such that Sj_i = s£_i and Sj < s*. 

We first show that if is a folding B'* of -B* of size n, FOLD(B, n) returns a folding 5' of B of size 
n satisfying s' < s'* . In this case, Claim 24 states that there is an integer < d* < d! such that 
s'd* -1 = *d*-i' s d* — s *d* + anc ^ n > 2 d * max(s d » + 1, 0) + A* d ,. Consider two cases: (1) s<i*-i = 
which occurs when s = s* or when s < s* and i > d* , and (2) s*d*-i < which occurs when s < s* 

and i < d*. 

(1) Sd*-i = s d * _ x . In this case, n > 2 d * max(s^, +1,0) + A^„ > 2 d * max(s rf . +1,0) + A d .. Thus, when 
executing FOLD(I?,n), the condition in line 4 is met and the algorithm does not return "FAILED". 
From Claim 25, FOLD(B,n) returns an n-size folding B 1 of B, such that for the maximum integer 
< d < d! for which n > 2 d max(s d + 1,0) + A d we have that s' d l = s d _i and r' < d + 1, and if 
r' = d + 1 then s' d = s d + 1 in case d = d' and s' d = s d + 2 in case d < d' . By selection, d* < d < d! . 
If < d, then s' d , = s^* = s d . t < s^*, and in particular s' < s'* and the claim follows. If d* = d, 
then s' d _ l = Sd-i = s*?_ 1 = s d _ 1 . If r' < d + 1 then s' < s'* from Claim 11, and the claim follows. If 
r' = d + 1, then s' d = s d + 1 in case d = d' and s' d = s d + 2 in case d < d' . In addition, from Claim 24, 
s'J > Sd + 1 in case d = d' and s^* > s d + 2 in case d < d' , thus in both cases s' d < s d . If s' d < s d then 

< s' d , and in particular s' < s'* and the claim follows. If s' d = s' d then s' d = s' d , and from Claim 11 
s' < s'* and the claim follows. 

(2) Sd*-i < In this case, for i < d* we have that Sj_i = s*_ l and Sj < s*. Now, n > 
2 d * max(s*. + 1,0) + A*, > A*» > A* +1 = 2 i abs(s*) + A* > 2*max(s*,0) + A*. Similarly as before, 
Claims 11 and 25 can be applied to show that s' < s'* . 

□ 

2.4. Correctness of Algorithm SEARCH-BFB. Assuming there is a BFB string a* admitting 
the algorithm's input count vector n, the BFB palindrome /?* = a* a* admits the count vector 2n. 
Let B* k+1 = 0, B* k , B* k ~ l , . . . , B* 1 be the block collection series corresponding to the layers of (3* as 
described in the main manuscript. Since B k+l = B* k+1 = (B k+1 is initialized in line 1 of Algorithm 
SEARCH-BFB), we have that s{B k+1 ) = s(B* k+l ). Assume that for some < / < k we have that 
s(B l+1 ) < s(B* l+1 ). Recall that the collection B* 1 is obtained by the wrapping of some folding B'* of 
size ni of B* l+1 . Since the wrapping operation does not change element multiplicities and top values, 
it follows that s(B* 1 ) = s(B'*). From Claim 26, the application of the FOLD procedure in the l- 
th iteration of the loop in lines 2-4 of the algorithm returns a folding B' of B l+1 of size n/, where 
s(B l ) = s(B') < s(B'*) = s(B* 1 ). Inductively, the algorithm does not return "FAILED" in each one 
of the loop iterations, and after the last iteration ^(B 1 ) < ^(B* 1 ). From the same arguments as above 
and since B* 1 can be folded into the single palindrome j3*, it follows that the application of FOLD in 
line 4 of the algorithm does not fail, and returns a collection containing a single palindrome (3 = aa, 
where a is a BFB string admitting n(a) = n. 

For the other direction of the proof, assume that the BFB algorithm returned the string a. In this 
case, the series of collections B k+1 ,B k , . . . , B 1 satisfies that each collection B l is an /-block collection 
of size ri\ and is obtained by folding and wrapping of the preceding collection in the series B l+1 . The 
final collection B 1 is folded into a single BFB palindrome j3 = aa admitting the count vector 2n, and 
therefore a is a BFB string admitting n. 
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2.5. Time Complexity of Algorithm SEARCH-BFB. 

2.5.1. Object Representation. The algorithm handles two types of objects: BFB palindromes, and BFB 
palindrome collections. BFB palindromes are further divided into three subtypes, who are implemented 
separately: empty palindromes, /-blocks, and composite /-BFB palindromes of the form /3j/3 (see Claim 
1 in the main paper). Each BFB palindrome object contains a filed maintaining the top value of the 
represented palindrome, allowing 0(1) time queries of this value. An empty palindrome is represented 
by an object containing only the top value field (which always holds the value 0), and generating new 
such objects take O(l) time. An /-block is implemented as an object containing, in addition to the top- 
value field, a pointer to its internal (/ + 1)-BFB palindrome. Given a pointer to the internal (/ + 1)-BFB 
palindrome, generating new /-block objects take O(l) time by copying the pointer, and setting the top 
value field to the top value of the pointed (/ + 1)-BFB palindrome. A composite /-BFB palindrome f3jf3 
is implemented by specifying a pointer to the /-BFB palindrome (3, and a list of /-BFB palindromes 
oji, ct2, ■ ■ ■ , Oi p representing the convexed /-collection A such that 7 = 7a- Composite palindromes can 
be generated in a time proportional to the order of their internal convexed /-collection (where the top 
value field is set to be the top value of (3). 

A collection B = {miPi, m^h-, ■ ■ ■ , mqPq} is implemented by an object containing a field which 
maintains the size \B\ of the collection, and two doubly linked lists maintaining the prefixes L r -\ 
and H r -i of the series L and H in the decomposition of B, where r = r(B). Note that for i > r, 
Li = Hi = 0. Each element Li or Hi is implemented as a linked list of /-BFB palindromes ordered with 
nondecreasing top values (it is possible that an Hi list contains multiple repeats of identical elements) . 
Thus, computing mm*(Lj) or mirtiHi) and extracting minimal elements from such lists is done in O(l) 
time. Generating an empty collection is done in O(l) time (where the two lists L r _i and H r _\ are 
empty), and duplicating or wrapping a collection B take at most 0(|S|) time (note that r — 1 < log \ B\, 
since an element j3 6 -B r -i corresponds to 2 r_1 repeats of f3 in B, and that the total number of elements 
in all lists Li and Hi is at most \B\). 

2.5.2. Type II Elementary Folding. Using the object representation described above, for a collection B 
such that e ^ B and an integer m > 0, it is possible to compute a type II elementary folding B' = B+me 
in 0(|.B| + m) time as follows. First, the number d = d m is computed. Note that d < log to (d can be 
defined as the index of the least significant bit different from in the binary representation of to) , and 
may be computed in O(logm) time. B' is initialized by copying B, i.e. generating the list L' r _ 1 and 
H' r _i (in 0(\B\) time). Then, if d > r, empty collections L\ and H[ are added to the prefixes of II and 
H' for r < i < d, and a single e element is added to L' d . Else, d < r, and a single e element is added 
as the first element in L' d (being of minimum top value among all elements in the list), and elements 
from collections L\ and H[ for i > d are moved into H' d . This latter modification is performed by first 
merging each L\ and H[ lists for i > d to a single list ordered with nondecreasing top values (in a linear 
time with respect to the number of elements in the two lists), and then, with increasing index i, each 
merged list is added to the beginning of H' d , where 2 l ~ d repeats of each element in the merged list of L\ 
and H[ are added to H' d . In both cases where d > r or d < r, it is possible to assert the modification 
updates properly the representation of B' to represent the collection B + 2 d e, that r(B') = d + 1, and 
that total time required for the modification is at most 0(\B\+d) = 0(\B\ +logm). Finally, additional 
|j — 1 repeats of e are added to H' d in 0(m) time, where now it is possible to assert that B' properly 
represents the collection B + me, and that the total computation time is 0(\B\ + m). 

2.5.3. Right- folding. In order to right-fold a collection B, the algorithm first gets pointers to the elements 
L r _i and i? r -i 5 in 0(r) time for r = r(B). Then, it starts traversing these lists backward for i = 1 — 1 
down to g, inclusive, where g is the first encountered index such that H g 7^ 0. For each such i, the 
algorithm extracts the first (minimal) element in the list Li, and accumulates these elements in a list 
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A. Finally, the algorithm extracts two copies of the minimal element (3 in H g , and uses (3 and A to 
construct the BFB palindrome a = (3ja(3. Then, a is inserted into L g . As this procedure takes 0(r) 
time and decreases the size of the collection by 2 r , any valid consecutive application of right-foldings 
over B takes at most 0(|-B|) time. 



2.5.4. The FOLD Procedure. Consider the application of the FOLD procedure on a collection B = 
m2(32, ■ • • , m q (3 q } and an integer n > 0. If n > \B\, the procedure applies in line 1 a type II 
elementary folding in 0(\B\ + n) time (Section 2.5.2) and hults. Otherwise, given the series L r _i and 
H r -i, it is possible to compute s r and A r +i in 0(r) = 0(log(\B\)) time. Note that Sj = for i > r, 
and A, = A r +i for i > r + 1. The number d\B\- n satisfies d\B\_ n < max(log(|S|) + log(n)). After 
computing s r and A r+ i, checking the condition in line 4, as well as computing the parameter d in line 5, 
can be done in 0{d\B\- n ) time. The two type II elementary foldings in lines 5 and 8 take 0{\B\ + n) 
time (Section 2.5.2), and the total time for right-folding applications in the loop in lines 6-7 is 0(|i?|) 
(Section 2.5.3). Thus, the total running time of the procedure is 0(\B\ + n). 



2.5.5. Overall Running Time. Let n = [ni,ri2, . . . , n^] be the input vector for the algorithm. Denote 
iV = ^2 n h an d note that iV is the length of the output string a in case the algorithm does not 
l<l<k 

return "FAILED". It is simple to assert that besides operations conducted within the FOLD procedure, 
Algorithm SEARCH-BFB performs O(N) operations. For every 1 < I < k, FOLD is called once by the 
BFB algorithm over the collection B l+l of size n; + i and the integer ni, and runs in 0(n; + i + ni) time 
(Section 2.5.4). Summing the running time of FOLD for I = k down to 1, its overall running time, as 
well as the overall running time of Algorithm SEARCH-BFB, is 0(N). 



3. The Decision Variant 

In this section, we describe a simplification of the SEARCH-BFB algorithm which solves the decision 
variant of the BFB count vector problem. Essentially, this algorithm applies similar steps to those of 
the search algorithm, yet instead of explicitly maintaining collections B, the algorithm only maintains 
the signature s of B. We assume that the algorithm maintains explicitly only the prefix s r of s (for 
r = r(B)) as a linked list, where for i > r the algorithm takes the value whenever it needs using the 
value Si. 



Algorithm: DECISION-BFB(n) 

Input: A count vector n= [ni,ri2,... , rik]. 

Output: "TRUE" if n is a BFB count vector, and "FAILED" if otherwise. 

1 Set rifc+i <- and s fe+1 <- 0. 

2 For I ^— k down to 1 do 

3 Apply SIGNATURE-FOLD(s i+1 ,n ;+ i,7ii). If this operation has failed, return "FALSE". 

4 Otherwise, set s l to be the returned value from the call to SIGNATURE-FOLD (s m+i, m). 

5 Apply SIGNATURE-FOLD (s 1 , m, 1). If this operation has failed, return "FALSE", and otherwise return 
"TRUE". 
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Procedure: SIGNATURE-FOLD (s, n B ,n) 

Input: The signature s and size n B of an i-block collection B and an integer n > 0. 

Output: The minimum signature s' of a folding B' of B such that \B'\ = n, or the string "FAILD" if there is no 
such B' . 

1 If «b < n then 

2 |_ return ADD-EMPTY(s,n B ,n - n B ). 

3 Else 

Compute the prefix Ad„ B _„ of A(B). 

If t/iere exists < d < d„ B _ n such that n> 2 d maxfsj + 1, 0) + Ad then 
Let d be the maximum integer sustaining the condition above. 
Set s' «- ADD-EMPTY(s, n s , 2 d ), and n B , «- n B + 2 d . 
If n > n B , +2 d+1 s d+1 then 

Set s ' d+1 ^s' d+1 +\^^]. 

Set n B , <- A d + 2 d abs(Sd) + 2 d+1 abs(s'd +1 ). 
Set s ' ADD-EMPTY(s" ,n B ,,n-n B ,). 



Else 



Set s d 
Set s' H 



d+l 



<- 0. 



2d 



+ 2s 



d+l- 



return s' . 
Else return "FAILED" 



Procedure: ADD-EMPTY(s, n B ,m) 

Input: The signature s and size n B of an i-BFB palindrome collection B containing no e elements, and an integer 
m > 0. 

Output: The signature s' of the folding B' — B + me of B. 

1 If n B — m then 

2 |^ return s 

3 Else 



Let d = d n - nB , and set the prefix s' d _ 1 to be the copy of the prefix s' d _ 1 of s. 

Set s' d ^- s d + 1. 

Compute A d = 2 I abs(s;). 



0<i<d 



Set s d +i ^~ ^ d+2 2 d+ S i 3d ' > - ■ // All values s'i for i > d + l are implicitly set to 0. 
return s'. 



The fact that the signature modifications applied by Procedure SIGNATURE-FOLD yield identical 
signatures to those of the collections computed by Procedure FOLD can be asserted from Conclu- 
sion 1 and Claims 18 and 23. It may also be asserted that the total number of operations in all 
calls to Procedure ADD-EMPTY (lines 2, 7, and 11 in Procedure SIGNATURE-FOLD), as well as 
the computation of A^ nB _ n in line 4, checking the condition in line 5, and computing d in line 6, is 
0(r(B)+r{B')) = 0{\ogn B + logn). Besides these operations, Procedure SIGNATURE-FOLD applies 
additional 0(1) operations, hence its total running time is Cologne + logn). Therefore, the overall run- 
ning time of Algorithm DECISION-BFB is O I ^ (logn m + logn ; ) ) = ol ^ lognj ) = O(N), 

\0<l<k ) \0<l<k J 

where N is the number of bits in the representation of the input vector n. A more involved amortized 
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analysis, omitted from this text, may show that the algorithm performs O(N) bit operations, hence 
being strictly linear with respect to its input length. 

4. The Distance Variant 

This section gives Algorithm DISTANCE-BFB for solving the distance variant of the BFB count 
vector problem. As a matter of fact, the presented algorithm solves the problem for every suffix n l = 
[ni , n/+i , . . . , rik] of the input vector n = [n\ , n-i , . . . , i%k] ■ 

For a vector n = [ni,ri2, • . . ,rik] of length k and an integer m, denote by [m,n] the (k + l)-length 
vector [m, ni, n2, . . . , nfc]. The algorithm is generic and may work with any vector distance measure 
5, provided that for any equal-length three vectors ft, n' , and ft" such that 8(n,n') < 5(n,n"), (1) 
S(n',n') = 5(n",n") = < 5{n,n') < S(n',n") < 1, and (2) for any pair of numbers m and m' , 
6([m, ft], [m',n']) < 8([m, ft], [m',n"]). For some precision parameter < rj < 1, the algorithm finds the 
exact solution for the distance variant of the BFB count vector problem for every suffix of the input 
vector for which the solution is at most 77, and returns the approximated solution 1 to suffixes for which 
the solution is greater than 77. 

Similarly to Algorithms SEARCH-BFB and DESCISION-BFB, Algorithm DISTANCE-BFB runs k 
iterations on an input vector n = [ni, n^, ■ ■ ■ , nk], indexed from k down to 1. At the end of iteration I, the 
algorithm computes a collection S l containing elements of the form (n l = [nj, nj +1 , . . . , n l k ], s where s i 
is the minimum signature of an Z-block collection B % admitting the count vector n l , and 8(n l ,n l ) < 77. 
It is guaranteed that for every BFB count vector fp = [nj,nj +1 , ... ,n J k ] such that 5(n l ,ni) < 77 and 
every /-block collection admitting tP, S l contains a pair (n l , s l ) such that 5(n l ,n l ) < 5{n l ,ni) and 
P < s j . 

Consider the signature s of a collection B of size n. It is simple to show that r(B) < logn + 1, 
and that — n < Si < n for every < i < r. Therefore, s can be represented by 0(log 2 n) bits, and so 
the number of different signatures of collections of size n is upper bounded by 2°( log ™) . In addition, 
under realistic assumptions, we may assume that the number of different values n examined in line 6 
of Algorithm DISTANCE-BFB bounded by 2°^ ni \ since this number should approximate the count 
n\ (for example, using the Poisson S function described by the main manuscript, it is possible to show 
that for every value of n; and ft' 1 and for n > 20ni, 5(n l , [n, ft' 1 ]) > 1 — 1CT 6 , thus choosing r\ = 1 — 1CT 6 
guarantees that the loop in lines 6-9 is being executed less than 20n/ times for every (n h ,s l ) G S l+1 ). 
Due to the condition in line 7, every possible signature s appears at most once in some pair in S l , 
thus the size of S l is bounded by 2°( log ni \ It is straightforward to observe that the total number of 
operations in the loop in lines 7-9 is also 2°( log ni \ and so the total running time of the algorithm is 
bounded by £ n ^ <2°^ N ^ = N°^ N \ 

l<l<k 

5. Chromosome simulation details 

Each chromosome pair was modeled as two sequences of 100,000,000 ordered bases. Then fifty re- 
arrangement were introduced to each chromosome independently. Each rearrangement type was chosen 
randomly from deletion, inversion, and duplication, according to a distribution. Thus, both balanced 
and unbalanced rearrangements were used to simulate the chromosomes. If the chosen rearrangement 
was a duplication, then it was decided whether or not the duplication would be tandem and whether or 
not it would be inverted. Tandem duplications would be inserted adjacent to the original chromosome 
segment, and inverted duplications would have the new duplicated segment reversed with respect to the 
original segment. 

Two rearrangement type regimes were used. In the first regime, referred to as "evendup" in the 
supplemental data, each rearrangement was a duplication, inversion, or deletion with probability .5, .25, 
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Algorithm: DISTANCE-BFB(n, 77) 

Input: A count vector ft = [m, ri2, ■ ■ ■ , rik], and a precision parameter < 77 < 1. 

Output: For every 1 < I < k, the algorithm reports the minimum distance Si of the suffix ft 1 = [n;, n;+i, . . . , n/t] of 
ft from a BFB count vector, in case this distance is at most r\. 

II Collections of the form S l contain pairs (n n = [n[, n[ +1 , . . . , n' k ], s l ) , where s l is the minimum 
signature of an /-block collection B % admitting the count vector n n , and S(n l ,n n ) < r] . 

1 Set S k+1 be a collection containing only the pair (0,0). 

2 For / <— k down to 1 do 
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Set 5i 1. 
Set S 1 ' 0. 

For each (n /l = [nj +1 , 



do 



For each n > 1 sttc/i i/ioi <5(n', [n,/!' 1 ]) < r\ do 

If SIGNATURE-FOLD{n'\n' l+1 ,n) = s, IS-PALINDROMIC(s), and for all (n' j ,s j ) G S l such that 
<5(n ! ,n' J ) < S(n L , [n,n' 1 ]) it is true that s < ft' 3 then 
Add ([n,n H ],S) to S' . 
Set 5 ; min(<5i, S(n\ [n, n"])). 



Report 81. 



Procedure: IS-PALINDROMIC(s) 

Input: The signature s*of an Z-BFB palindrome collection B. 

Output: "TRUE" if it is possible to concatenate all elements in B into a single /-BFB palindrome, and "FALSE" 
otherwise. 

1 Compute the prefix A r+i of A(B) for r = r(B). II Note that \B\ = A r+i 

2 If there exists 0<d< d Ar+1 -i such that 1 > 2 d max(s d + 1, 0) + A d then return "TRUE". 

3 Else return "False". 



and .25 respectively. Duplications had a 50% chance of being tandem and, independently, a 50% chance 
of being inverted. In the second regime, called "highdup" in the supplemental data, the probability of 
duplication, inversion, and deletion were ji, and jj. The probability of a duplication being tandem 
or inverted was .9 and .9. This second regime was created because in the first, fold-back inversions occur 
infrequently. The second regime allowed us to examine tests for BFB when an alternative mechanism 
is creating many fold-back inversions. 

The size of each non-BFB rearrangement was chosen from a normal distribution bounded at zero 
with mean 10,000 and a variance of 10,000,000. Rearrangements were introduced sequentially in each 
chromosome. For chromsomes in which BFB was simulated, consecutive rounds of BFB were introduced 
after one of the fifty non-BFB rearrangments. The number of BFB rounds varied from two to ten. Each 
BFB round consisted of a prefix of the chromosome undergoing a tandem inverted duplication. The 
size of the prefix was selected from a normal distribution with a mean of zero and a standard deviation 
of one tenth of the length of the chromosome. 

After each chromosome in the pair was rearranged, the copy numbers and breakpoints were combined 
as one would expect from experimental evidence. 



6. Cancer cell line results 

We identified count vectors on three chromosomes from the 746 cancer cell lines that were long and 
nearly consistent with BFB. The observed count vectors along with the nearest count vector consistent 
with BFB are shown below. 
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Cell line: AU565 Tissue: bone 

Chromosome 8 between 72.5 MB and 80.0 MB 
Observed 4 , 8 , 14 , 10 , 8 , 14 , 9 , 13 , 7 , 12 , 9 , 7 
Fit 4,8,14,10,8,14,9,13,7,13,9,7 

Cell line: PC-3 Tissue: prostate 

Chromosome 10 between 60 MB and 82 MB 
Observed 6,10,14, 9,6,9,13,9,5,9,3,14 
Fit 6,10,14,10,6,9,13,9,5,9,3,15 

Cell line: MG-63 Tissue: bone 

Chromosome 8 between 112 MB and 121 MB 
Observed 10 , 6 , 8 , 14 , 1 1 , 14 , 9 , 8 , 13 , 9 , 13 , 9 , 7 
Fit 10,6,8,14,11,15,9,9,13,9,13,9,7 
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7. ROC CURVES FOR VARYING SIMULATION PARAMETERS 

Below are the ROC curves, similar to those in Figure 4 of the main paper, for many different simulation 
and test parameters. 
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Figure SI. ROC curves for simulations with 2 rounds of BFB. Clockwise from the 
upper left, evendup background with no use of fold-back fraction, evendup background 
using fold-backs, highdup background using fold-backs, highdup background with no use 
of fold-back fraction. 
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Figure S2. ROC curves for simulations with 4 rounds of BFB. Clockwise from the 
upper left, evendup background with no use of fold-back fraction, evendup background 
using fold-backs, highdup background using fold-backs, highdup background with no use 
of fold-back fraction. 
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Figure S3. ROC curves for simulations with 6 rounds of BFB. Clockwise from the 
upper left, evendup background with no use of fold-back fraction, evendup background 
using fold-backs, highdup background using fold-backs, highdup background with no use 
of fold-back fraction. 





Figure S4. ROC curves for simulations with 8 rounds of BFB. Clockwise from the 
upper left, evendup background with no use of fold-back fraction, evendup background 
using fold-backs, highdup background using fold-backs, highdup background with no use 
of fold-back fraction. 
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Figure S5. ROC curves for simulations with 10 rounds of BFB. Clockwise from the 
upper left, evendup background with no use of fold-back fraction, evendup background 
using fold-backs, highdup background using fold-backs, highdup background with no use 
of fold-back fraction. 



8. Pancreatic cancer data analysis pipeline 
Figure S6 shows a graphical layout of the analysis. 
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Figure S6. Graphical representation of the analysis performed with the pancreat 
cancer paired-end sequencing data. 



9. Possible arrangement of segments on BFB-rearranged chromosome 12 
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Figure S7. Plausible BFB cycles that could lead to the copy counts observed in chro- 
mosome 12 of pancreatic cancer sample PD3641. 



